CN114880436A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN114880436A
CN114880436A CN202210590875.6A CN202210590875A CN114880436A CN 114880436 A CN114880436 A CN 114880436A CN 202210590875 A CN202210590875 A CN 202210590875A CN 114880436 A CN114880436 A CN 114880436A
Authority
CN
China
Prior art keywords
text
written
written language
language text
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210590875.6A
Other languages
Chinese (zh)
Inventor
弓源
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202210590875.6A priority Critical patent/CN114880436A/en
Publication of CN114880436A publication Critical patent/CN114880436A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a text processing method and a text processing device, wherein the text processing method comprises the following steps: acquiring a target spoken language text; inputting the target spoken language text into a text classification model for classification processing to obtain a predicted text type corresponding to the target spoken language text; determining a rewriting model corresponding to the target spoken language text according to the predicted text type; and inputting the target spoken language text into the rewriting model to rewrite the written language, and obtaining the written language text corresponding to the target spoken language text.

Description

Text processing method and device
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text processing method and apparatus.
Background
Artificial Intelligence (AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive the environment, as well as the ability to acquire, process, apply, and represent knowledge. The artificial intelligence technology comprises machine learning, knowledge mapping, natural language processing, computer vision, man-machine interaction, biological feature recognition, virtual reality/augmented reality and other key technologies. Natural Language Processing (NLP) refers to the operation and Processing of information such as the shape, sound, meaning, etc. of Natural Language, i.e. the input, output, recognition, analysis, understanding, generation, etc. of characters, words, sentences and sections, by using a computer.
The text generation task is an important research direction in the field of natural language processing, wherein machine translation, abstract generation, text style migration and the like are important tasks in the field of natural text generation. The rewriting of the spoken language text-written language text is one of the important tasks in the field of natural text generation, and has important application in daily work and life. For example, in analysis application scenarios involving spoken texts, such as analysis of recorded texts, conference speech text summary, and transfer of important written language material documents, the transfer quality of the transfer of spoken texts into written language texts is of great importance. However, the task of rewriting the spoken text to the written text is a great challenge due to the fact that the spoken text has uneven quality and the text generation task has large uncertainty and discontinuity of the generation result.
Disclosure of Invention
In view of this, embodiments of the present application provide a text processing method to solve technical defects in the prior art. The embodiment of the application also provides a text processing device, a computing device and a computer readable storage medium.
According to a first aspect of embodiments of the present application, there is provided a text processing method, including:
acquiring a target spoken language text;
inputting the target spoken language text into a text classification model for classification processing to obtain a predicted text type corresponding to the target spoken language text;
determining a rewriting model corresponding to the target spoken language text according to the predicted text type;
and inputting the target spoken language text into the rewriting model to rewrite the written language, and obtaining the written language text corresponding to the target spoken language text.
According to a second aspect of embodiments of the present application, there is provided a text processing apparatus including:
an acquisition module configured to acquire a target spoken language text;
the classification module is configured to classify the target spoken language text input text classification model to obtain a predicted text type corresponding to the target spoken language text;
the selection module is configured to determine a rewriting model corresponding to the target spoken language text according to the predicted text type;
and the processing module is configured to input the target spoken language text into the rewriting model to rewrite the written language, and obtain the written language text corresponding to the target spoken language text.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions that when executed by the processor implement the steps of the text processing method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the text processing method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the text processing method.
According to the text processing method, after the target spoken language text is obtained, in order to improve the conversion efficiency of the target spoken language text, the target spoken language text can be firstly input into a text classification model for classification processing, a predicted text type corresponding to the target spoken language text is obtained, and the semantic expression definition of the target spoken language text can be represented through the predicted text type; then determining a rewriting model corresponding to the target spoken language text with current semantic definition according to the predicted text type, and performing written language rewriting processing on the target spoken language text by using the rewriting model to obtain a written language text corresponding to the target spoken language text; the method has the advantages that in the text processing process, the mode of determining the text type through the text classification model can determine the rewriting model for processing the target spoken language text in the current scene, so that the target spoken language text can be accurately processed, and the rewriting efficiency is effectively improved through the mode of matching the text classification model with the rewriting model.
Drawings
FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;
FIG. 2 is a diagram illustrating a text processing method according to an embodiment of the present application;
FIG. 3 is a flowchart of a text processing method according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating sample corpora constructed in a text processing method according to an embodiment of the present application;
fig. 5 is a processing flow chart of a text processing method applied to an actual scene according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit and scope of this application, and thus this application is not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application is intended to encompass any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "responsive to a determination," depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
Seq2Seq (Sequence to Sequence) model: a series of machine learning methods for natural language processing are commonly used in the application fields of machine translation, image description, dialogue models, text summarization, and the like.
Transformer model: a deep learning model adopts an attention mechanism to differentially weight the importance of each part of input data, and is widely applied to various natural language processing tasks.
Text classification: meaning that in a given classification system, text is assigned to be classified into one or several categories.
Natural Language Generation (NLG, Natural Language Generation): as part of natural language processing, natural language text is generated from a machine expression system such as a knowledge base or logical form.
Text style migration: text in one stylistic form is transcribed to produce text in another stylistic form.
And (3) abstract generation: through the technical scheme, the process of compressing, summarizing and summarizing the long text is realized, so that the short text with the generalized meaning is formed.
And (3) machine translation: a process for converting one natural language (source language) to another natural language (target language) using a computer.
Entity: refers to a description of a word or phrase of an entity having a particular meaning in the text.
Part of speech tagging: the method is a process of judging the grammar category of each word in a given sentence, determining the part of speech of each word and labeling, and is also a very important basic work in Natural Language Processing (NLP).
And (3) syntactic analysis: is one of the key underlying technologies in Natural Language Processing (NLP), and the basic task is to determine the syntactic structure of a sentence or the dependency relationship between words in the sentence.
In the present application, a text processing method is provided, and the present application relates to a text processing apparatus, a computing device and a computer readable storage medium, which are described in detail in the following embodiments one by one.
Fig. 1 illustrates a block diagram of a computing device 100 provided according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), such as IEEE802, whether wired or wireless. 11 Wireless Local Area Network (WLAN) wireless interface, global microwave internet access (Wi-MAX) interface, ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, bluetooth interface, Near Field Communication (NFC) interface, and the like.
In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
In practical application, because the change of the spoken text-written language text has important application in daily work and life, in the prior art, in order to realize the change of the spoken text-written language text, the spoken text can be changed into the written and spoken text in an artificial way; or the method of regular rewriting can be adopted to rewrite and replace part of the processable spoken language expression; in addition, the spoken text can be directly translated into written language text by text translation.
The manual rewriting mode is adopted, a large amount of manpower is consumed, and the quality and the result of text transcription are not uniform; the method adopts a regular mode for rewriting, can only process a limited small number of spoken words and fixed text forms, and has higher processing complexity of rewritten logic rules; the method can realize the transcription effect to a certain degree, but is not suitable for the task of transcribing the spoken language to the written language on the whole.
Therefore, in order to accurately rewrite the spoken text-written text, the spoken text can be rewritten using a rewrite model trained in advance. However, due to the uneven quality of the spoken texts, if a uniform rewrite model is used to process the spoken texts, an accurate rewrite effect may not be achieved. Therefore, an effective solution to solve the above problems is needed.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating a text processing method according to an embodiment of the present application. After the target spoken language text is obtained, the target spoken language text is input into a text classification model, and the text classification model is obtained by training an initial text classification model through pre-collected text corpus semantic definition label data. Further, the text classification model performs text classification on the input target spoken language text to output a predicted text type corresponding to the target spoken language text. And determining the rewriting fuzziness corresponding to the target spoken text according to the predicted text type (namely a written language rewriting model or a written language conversion model; if the predicted text type is a standard text type, the rewriting fuzziness is a written language rewriting model; and if the predicted text type is a fuzzy text type, the rewriting fuzziness is a written language conversion model). The rewriting fuzziness is obtained by performing model training on a sample corpus constructed by aligning data of a spoken text and a written text. Further, the written language text corresponding to the target spoken language text can be output by performing written language rewriting on the input target spoken language text.
According to the method and the device, the target spoken language text is classified according to the quality of the target spoken language text, so that the rewriting model suitable for the target spoken language text is selected and is rewritten, written language rewriting is more targeted, and accuracy of written language rewriting is improved.
Fig. 3 shows a flowchart of a text processing method according to an embodiment of the present application, which specifically includes the following steps:
step 302: and acquiring a target spoken language text.
The target spoken language text is the spoken language text to be rewritten. In practical applications, the target spoken text may be spoken text in any field, such as spoken text in the medical field, spoken text in the chemical field, spoken text in the sales field, spoken text in the daily life field, spoken text in the travel field, and the like. In addition, the number of texts of the target spoken text may be one or more.
In this embodiment, the text processing method is described by taking the obtained target spoken language text TST as an example, and the processing processes of other target spoken language texts may refer to the same or similar descriptions in this embodiment, which is not limited herein.
Step 304: and classifying the target spoken language text to obtain a text type corresponding to the target spoken language text.
Specifically, on the basis of obtaining the target spoken language text, the target spoken language text to be rewritten in written language may have different quality, and in this case, the written language is directly rewritten in the target spoken language text, and the quality of the rewriting may not be guaranteed.
Further, the target spoken language text is classified to obtain a text type corresponding to the target spoken language text, and the method is specifically implemented as follows:
inputting the target spoken language text into a text classification model for classification processing to obtain a text type corresponding to the target spoken language text; the training of the text classification model comprises the following steps:
acquiring a sample spoken language text and a semantic definition label corresponding to the sample spoken language text;
constructing a training sample pair based on the sample spoken language text and the semantic definition label;
and performing model training on the initial text classification model through the training sample pair until the text classification model meeting the classification training stopping condition is obtained.
The text classification model is a model which is trained in advance and used for classifying the target spoken language text, can be a two-classification model, and classifies the target spoken language text into a standard text type or a fuzzy text type by performing text semantic definition recognition on the target spoken language text. The standard text type means that the text semantic expression of the target spoken language text is clear; the fuzzy text type means that the semantic expression of the text of the target spoken language is fuzzy. The semantic definition label refers to a label labeled according to the semantic definition of the sample spoken language text.
Furthermore, it makes no sense to rewrite this type of text in written language, considering that the target spoken text may also be text that does not contain semantic information. Therefore, the target spoken language text can be classified into a standard text type, a fuzzy text type or an invalid text type by adopting a three-classification model.
In practical application, if the text classification model is a binary classification model, the semantic definition tag includes: a standard text type and a fuzzy text type. If the text classification model is a three-classification model, the semantic definition label comprises: a standard text type, a fuzzy text type, and an invalid text type.
It should be noted that the semantic definition label of the sample spoken language text needs to be labeled in advance, and then the sample spoken language text and the semantic definition label are used as text corpus semantic definition labeling data to perform model training on the initial text type model. The initial text classification model may be a text classification model to be trained, which is previously constructed by a CNN (convolutional neural network), RNN (cyclic neural network), LSTM (long-term memory network), FastText, TextCNN, HAN model, and the like.
In the practical application, the loss function for calculating the model loss value can be a 0-1 loss function, an absolute value loss function, a square loss function, a cross entropy loss function and the like in the practical application, and here, the 0-1 loss function is taken as an example for explanation, which is shown in the following formula 1:
Figure BDA0003667333330000061
wherein L represents a loss value, f (x) represents a predicted text type, and Y represents a sample text type, and in the present application, the selection of the loss function is not limited, subject to practical application.
After the model loss value is calculated, the model parameters of the initial text classification model can be reversely adjusted according to the model loss value, and the initial text classification model is continuously trained by sampling the semantic definition annotation data of the text corpus of the next batch until reaching a classification training stopping condition, specifically, the classification training stopping condition can be that the model loss value is smaller than a preset threshold value or the training iteration number reaches the preset iteration number, and the like, and is not limited herein.
In conclusion, the semantic definition classification is carried out on the target spoken language text through the text classification model trained in advance, so that the semantic quality of the target spoken language text can be effectively identified, the target spoken language texts of different types can be reasonably rewritten, and the quality of written language rewriting is guaranteed.
Step 306: and in the case that the text type is a standard text type, selecting a corresponding written language rewriting model according to the standard text type.
Specifically, after the text type corresponding to the target spoken text is determined, in consideration of the situation that the semantic expression of the target spoken text is relatively clear (that is, the text type is a standard text type), when the written language is rewritten, a relatively complicated rewriting mode can be adopted on the basis of guaranteeing the semantic meaning. Therefore, for the target spoken language text of the standard text type, a written language rewrite model for performing relatively complicated rewrite can be selected and processed.
The written language rewrite model is a module for rewriting a spoken language text into a written language text. Specifically, the written language rewrite model may be constructed based on a Seq2Seq model, and both the encoder and the decoder in the Seq2Seq model may be constructed by using a Transformer model.
According to the above example, the target spoken text TST is input into a text classification model, and the text type corresponding to the target spoken text TST output by the text classification model is obtained as a standard text type, and then the target spoken text TST is input into a written language rewrite model, and the target written language text TLT1 output by the written language rewrite model is obtained.
Step 308: and inputting the target spoken text into the written language rewriting model for processing to obtain a target written language text corresponding to the target spoken text.
The written language rewriting model is obtained by training a spoken language text obtained by retracing and converting the written language text based on the written language text.
Specifically, in addition to the selection of the written language rewrite model, the written language rewrite model can rewrite the target spoken language text, thereby obtaining the target written language text generated after the rewrite.
During specific implementation, in order to ensure the accuracy of written language rewriting of the written language rewriting model and avoid the situation that a text generation result is uncontrollable, a character-level mask operation can be adopted in the rewriting process. Through the character-level mask operation, the generation result of written language rewriting is guaranteed to be mainly from an input text, in the embodiment of the application, the written language rewriting model comprises an encoding layer and a decoding layer, the written language rewriting model is used for rewriting a target spoken language text, and the method is realized by adopting the following specific method:
carrying out sentence splitting processing on the target spoken language text to obtain a sentence sequence contained in the target spoken language text;
sequentially inputting oral sentence units in a sentence sequence into a coding layer of a written language rewriting model for coding, and obtaining sentence characteristic vectors and vocabulary vectors corresponding to the oral sentence units, wherein the vocabulary vectors are obtained by mapping the oral sentence units and the vocabulary;
and calculating a vector product between the sentence characteristic vector and the word list vector, inputting the vector product into a decoding layer of the written language rewriting model for decoding, and obtaining a target written language text corresponding to the target spoken language text.
The coding layer is a hierarchical structure in the text generation model, and information is expressed by converting the information into another form for processing in the model. Correspondingly, the statement feature vector specifically refers to a vector expression obtained after the mouth statement unit is encoded. The decoding layer is specifically a hierarchical structure used for converting the sentence characteristic vector into a decoding vector in the written language rewriting model, and in practical application, after the decoding vector is output by a decoder, the decoding vector is input to the output layer to obtain a target written language text output by the output layer.
The word list refers to a word list. Specifically, the vocabulary may be generated by counting the frequency of words/characters appearing in the sample corpus during the training of the written language rewrite model (for example, adding characters/words with the frequency of appearance greater than a threshold value in the training sample corpus into the vocabulary), may be carried by the model itself, and may be generated in other ways. The sentence sequence is a sequence formed by arranging the spoken sentences contained in the target spoken language text according to the sequence in the target spoken language text. Accordingly, a spoken sentence unit refers to a spoken sentence included in a sentence sequence.
In specific implementation, the mapping between the spoken sentence unit and the word list means that characters/words in the spoken sentence are matched with characters/words in the word list; if any character/word in the word list is hit by the character/word in the spoken sentence, the vector bit corresponding to the hit character/word in the word list is set to 1, and the vector bit corresponding to the missed character/word in the word list is set to 0, so that the word list vector can be obtained. For example, a vocabulary table includes 5000 characters, and 4 characters in a spoken sentence 1 are mapped with the vocabulary table, where the 1 st character maps the 3 rd character in the vocabulary table, the 2 nd character maps the 6 th character in the vocabulary table, the 3 rd character maps the 9 th character in the vocabulary table, and the 4 th character maps the 5 th character in the vocabulary table, so that the obtained vocabulary vector is 00101100100 … … 0.
Furthermore, the vector product between the word list vector and the sentence characteristic vector is calculated, and decoding is performed based on the vector product, so that constraint limitation is performed when the decoding output is performed by counting the input text characters/words. The above operations implemented by vocabulary may also be referred to as character-level masking operations.
In conclusion, the text characters generated by the written language rewriting model are mainly from the input text source through the character-level mask operation, and the semantic deviation of the rewriting result of the written language rewriting model is greatly avoided.
In specific implementation, the written language rewrite training of the model is specifically realized through the following steps 30802 to 30810:
step 30802: and acquiring written language text.
Written language text refers to text formed using the language that people use when writing and reading an article, with words being the main components. The written language text can be written language text in any field, such as written language text in the medical field, written language text in the chemical field, written language text in the sales field, written language text in the daily life field, written language text in the travel field, and the like.
Such as: the obtained written language text is written language text LT of the literature corpus.
Step 30804: and obtaining a retranslate written language text corresponding to the written language text by performing retranslate processing on the written language text.
Specifically, on the basis of obtaining the written language text, considering that the written language text is simply converted to generate the corresponding spoken language text, the expansion of the sample corpus may still be limited, and in order to further expand the sample corpus, the written language text may be expanded by means of a back translation process on the written language text, and then the expanded written language text is subjected to text conversion to be converted into the spoken language text corresponding to the written language text.
The translation processing refers to a process of translating the text in the language a into the language B and then translating the text in the language B back into the language a. In practical application, the retranslate written language text generated by the retranslate processing can generate a text expression which is different from the original written language text, so the retranslate written language text generated by the retranslate processing can be expanded into the written language text.
Further, in consideration that the generated retranslate written language text after the retranslation processing may have a large difference from the original written language text and may lose the meaning to be expressed by the original written language text, in order to ensure that the key information in the retranslate written language text and the written language text remains unchanged, the retranslate written language text generated by the retranslation may be replaced by the key words in the written language text, which is specifically implemented in the following manner in the embodiment of the present application:
translating the written language text into a translated text written language text corresponding to a preset language;
the translated written language text is translated back into the target language to which the written language text belongs, and an initial translated written language text is obtained;
and replacing target key words corresponding to the key words in the initial retranslate written language text by the key words in the written language text to obtain the retranslate written language text.
The predetermined language may be any one or more of english, french, korean, german, etc., without limitation. Accordingly, the target language refers to the language to which the characters in the written language text belong.
In practical application, the written language text is firstly translated into text of other languages, namely, the translated written language text. And translating the written text of the translated text back to the language to which the written text of the translated text belongs to obtain the initial translated written text. This initial translation of the written language text may deviate significantly from the expressed meaning of the written language text due to the translation process. In order to keep the key information unchanged for the two texts, the corresponding words (namely the target key words) in the initial translation written language text can be replaced by the key words in the written language text, so that the translation written language text with the key information consistent with the written language text is generated.
The key words can be words which are selected from the written language texts in advance and are considered to be important to the written language texts, and in practical application, the key words can be selected according to a preset selection rule, wherein the preset selection rule can be selected according to parts of speech or entity types of the words. In addition, the keyword can be selected through a preset keyword library, and the words in the keyword library contained in the written language text are used as the keyword, and the like.
In specific implementation, target key words corresponding to the key words in the initial translation written language text are replaced by the key words in the written language text, and the target key words corresponding to the key words need to be determined first. Specifically, the determination manner may be various, for example, the determination manner may be determined according to a position relationship between the key word and the target key word in the text sentence, or may be determined by searching for a corresponding synonym of the key word in the initial translated back written language text, taking the synonym as the target key word, or may be determined by a sentence component to which the key word belongs in the text sentence, taking a word belonging to the same sentence component as the target key word (for example, a subject, a predicate, an object, a fixed language, a resultant, or a complement in the sentence may be taken as the key word, and a word having the same component as the target key word is selected in the initial translated back written language text). In practical application, a suitable mode can be selected according to an actual scene to determine a target key word corresponding to the key word.
After the target key words corresponding to the key words are determined, the corresponding target key words in the initial retranslate written language text are replaced through the key words, and then the retranslate written language text can be obtained.
Following the above example, on the basis of determining that the language to which the written text LT belongs is chinese and the preset language is german, the written text LT of chinese is translated into german to obtain the translated written text LT1 of german, and then the translated written text LT1 of german is translated into: chinese, the initial retraced written language text LT2 of chinese is obtained. Assume that written language sentence S1 is contained in written language text LT, and this written language sentence S1 is specifically "my hometown is shanxi, where is beautiful". The key word in the written sentence S1 is a geographical location entity "shanxi", and in the case that the written sentence S11 corresponding to the written sentence S1 in the initial retracing written sentence LT2 is "shanxi in my hometown, where is very beautiful", the target key word corresponding to the key word in the written sentence S11 corresponding to the written sentence S1 is a geographical location entity "shanxi", and the "shanxi" in the written sentence S11 is replaced by the "shanxi" to obtain a retracing written sentence LT3, and the retracing written sentence LT3 includes the written sentence S12 "in which the written sentence S11 is replaced, where is shanxi in my hometown, where is very beautiful.
In conclusion, in the retracing process, the corresponding target key words in the retraced written language text are replaced by the key words in the written language text, so that the consistency of the retraced written language text and the key information in the written language text is ensured under the condition of performing language material expansion on the written language text. The accuracy of the back translation of the written language text is improved.
In specific implementation, in consideration that it is important to accurately determine target key words corresponding to the key words for keeping consistency between the translated written language text and text meanings in the written language text, in order to avoid determining that wrong target key words are replaced by the key words, it is possible to ensure that the target key words corresponding to the key words can be accurately obtained and replaced by adding position marks to the key words in the written language text, in the embodiment of the present application, before translating the written language text into the translated written language text corresponding to the preset language, the method further includes:
identifying key words with parts of speech being preset parts of speech in the written language text by analyzing the parts of speech of the written language text;
marking the positions of the key words in the written language text;
correspondingly, target key words corresponding to the key words in the initial retranslate written language text are replaced by the key words in the written language text, and the retranslate written language text is obtained and comprises the following steps:
and replacing corresponding target key words in the initial retranslate written language text by the key words based on the position marks to obtain the retranslate written language text.
Specifically, the part-of-speech analysis of the written language text may be to determine what part-of-speech words are in the written language text by means of part-of-speech tagging of the words in the written language text. The part-of-speech tagging can adopt a part-of-speech tagging method based on rules, a part-of-speech tagging method based on a statistical model, and a part-of-speech tagging method based on a combination of a statistical method and a rule method. Accordingly, the part of speech refers to the characteristic of a word as the basis for dividing the part of speech, and the part of speech may be a noun part of speech, a verb part of speech, an adjective part of speech, a digit part of speech, and the like. In practical applications, since written language texts may belong to different fields, and words with parts of speech considered important in different fields (i.e. key words) may be different, for example, in the chemical field, words with parts of speech of numbers are considered as key words, and in the daily life field, words with parts of speech of nouns are considered as key words.
In specific implementation, the positions of the key words are marked by using signs such as braces { } or asterisks. In practical applications, the position mark may be added at positions before and after the keyword. For example, the key terms are: and the mobile phone marks the position of the keyword through braces, { }, and the marked key word is { mobile phone }.
Before the written language text is translated, the keyword words in the written language text are marked in position. The position mark can be still kept in the initial retranslate written language text obtained after the written language text is retranslated, and the word marked by the position mark in the initial retranslate written language text is the target key word. Namely, the target key words corresponding to the key words can be accurately positioned in a position marking mode, so that the target key words can be accurately replaced. When a plurality of key words exist in one written sentence, the target key word corresponding to the key word may be determined according to the similarity between the word marked by the position mark and the key word, or a mark word having the same sentence component may be determined as the target key word corresponding to the key word according to the component (for example, the sentence component such as subject, predicate, object) of the word marked by the position mark in the sentence.
In addition, in order to facilitate subsequent text processing on the written language text and the replaced text after replacement, the position marks in the written language text and the replaced text can be deleted, and the retranslate written language text can be obtained by deleting the position marks in the replaced text.
Taking a written language sentence S1 in the written language text LT as an example, performing part-of-speech analysis on the written language sentence S1, wherein the identification of the key words with preset parts-of-speech as nouns in the written language sentence S1 includes: "Country" and "Shanxi", the nouns "Country" and "Shanxi" are position-marked by the position mark { }, and the marked written language sentence S1 is obtained, and the marked written language sentence S1 is "My { Country } is { Shanxi }, which is beautiful there. And the written language sentence S11 in the initial translation written language text LT2 corresponding to the marked written language sentence S1 is "my { home } is { shanxi }, and is very beautiful there", the "home" of the target key word corresponding to the position mark "{ }" in the written language sentence S11 is replaced by "home country", and the "shanxi" of the target key word corresponding to the position mark "{ }" in the written language sentence S11 is replaced by "shanxi", and the written language sentence S11 after replacement is "my { home country } is { shanxi }, and is very beautiful there", the position mark in the written language sentence S11 after replacement is deleted, and the written language sentence S12 after deletion is "my home country is shanxi", and is very beautiful there.
In conclusion, after the keyword words in the written language text are subjected to position marking before translation, the target keyword words are determined and replaced through the position marking, and the replacement accuracy and efficiency are improved.
Step 30806: and respectively carrying out conversion processing of sentence forming units on the written language text and the retraced written language text to obtain the spoken language text.
Specifically, on the basis of obtaining the retraced written language text corresponding to the written language text by performing retracing processing on the written language text, in order to further expand the spoken language text, conversion processing may be performed on the written language text and the retraced written language text, respectively, so as to obtain the corresponding spoken language text.
During specific implementation, the spoken language text obtained after conversion can be further screened to obtain the spoken language text with relatively accurate semantic expression, so that the accuracy of spoken language conversion of the written language text is further improved.
Optionally, the sentence component unit comprises at least one of: clause unit, word unit, character unit and symbol unit.
In practical applications, since written language text is usually composed of written language sentences, written language sentences are usually composed of a plurality of sentence constituting units, which include: clause units (clauses), word units (words), character units (characters), symbol units (punctuation marks), and the like. Each sentence component unit may have a difference between written language expression and spoken language expression, and therefore, the written language sentence can be converted for each sentence component unit, so that the written language sentence has more characteristics of spoken language expression in clause units, word units, character units, symbol units, and the like.
For example, under the condition that the written sentence is "today is sunny, there is no cloud in all miles, and is suitable for play while going out", the written sentence includes 3 clauses, where clause 1 is: "today's weather is sunny", clause 2 is: "Wanliwuyun", clause 3 is: "suitable for play out", the 3 clauses are separated by commas in the written sentence. Correspondingly, word units refer to words in written language sentences. The character unit refers to a character in a written language sentence, and the character can be understood as a word in english or a single word in chinese, and is not limited herein. The symbol unit refers to punctuation marks in written language sentences, such as commas, quotation marks, dashes, and the like, and is not limited herein. In a specific embodiment, the written language sentence in the written language text may be subjected to a treatment such as adjustment or rewriting at a clause level, and/or a treatment such as adjustment or rewriting at a word level, and/or a treatment such as adjustment or rewriting at a character level, and/or a treatment such as adjustment or rewriting at a symbol level, and/or a treatment such as rewriting at a symbol level, and the like.
In specific implementation, the written language text and the retranslated written language text are used as different written language linguistic data to construct a sample linguistic data, so that the written language text and the retranslated written language text need to be respectively subjected to sentence component unit conversion processing to obtain corresponding spoken language texts, and the method is specifically implemented in the embodiment of the application in the following manner:
performing sentence composition unit conversion processing on the written language text to obtain a first spoken language text corresponding to the written language text;
performing sentence component unit conversion processing on the retranslated written language text to obtain a second spoken language text corresponding to the retranslated written language text;
and taking the first spoken text and the second spoken text as the spoken text.
The first spoken language text is a spoken language text obtained by converting the written language text. The second spoken text is the spoken text obtained by converting the translated written text.
Following the above example, the written language text LT is subjected to the conversion processing of the sentence component unit to obtain the first spoken text ST1 corresponding to the written language text LT, and the translated back written language text LT3 is subjected to the conversion processing of the sentence component unit to obtain the second spoken text ST2 corresponding to the translated back written language text LT3, and the first spoken text ST1 and the second spoken text ST2 are taken as spoken texts.
In conclusion, the corresponding spoken texts are obtained by respectively performing the conversion processing of the sentence forming units on the written language texts and the retranslated written language texts, and the obtained two spoken texts are used as the spoken texts, namely the two spoken texts are obtained, so that the extension of the spoken texts is realized.
In practical application, although there may be many differences between spoken language expression and written language expression, these differences are not reflected in each sentence, but occur with a certain probability according to the expression habit of the speaker, in order to make the converted written language text more consistent with the spoken language features, a corresponding conversion processing probability may be set for each conversion processing strategy, and whether to execute the conversion processing strategy is determined according to the conversion processing probability, which is specifically implemented by the following method:
determining conversion processing probability corresponding to a conversion processing strategy of the written language text to be processed;
determining a target conversion processing strategy to be executed in the conversion processing strategies based on the conversion processing probability;
and performing sentence component unit conversion processing on the written language text by executing a target conversion processing strategy to obtain a spoken language text corresponding to the written language text to be processed.
The conversion processing strategy refers to a preset method (strategy) for performing conversion processing on the written texts to be processed. Specifically, the conversion processing policy may include at least one of the following: a clause conversion processing policy (a processing policy of performing clause units on written language sentences), a word conversion processing policy (a policy of performing word unit conversion processing on written language sentences), a character conversion processing policy (a policy of performing character unit conversion processing on written language sentences), and a sign conversion processing policy (a policy of performing sign unit conversion processing on written language sentences).
The clause conversion processing strategy can be copy processing (namely, a copy clause conversion processing strategy) of the clauses, out-of-order processing, flip processing and/or the like. The word conversion processing strategy can be adding processing, repeated processing, out-of-order processing and/or the like of words. The character conversion processing policy may be character out-of-order processing or the like. The symbol conversion processing strategy may be deleting symbol processing, adding symbol processing, and/or modifying symbol processing, etc.
Specifically, the conversion processing probability corresponding to the conversion processing policy refers to a probability of executing the conversion processing policy. In practical applications, each conversion processing strategy may have a corresponding conversion processing probability. Further, a target conversion processing strategy to be executed is determined in the conversion processing strategies based on the conversion processing probability. Taking the conversion processing strategy a as an example, the conversion processing probability corresponding to the conversion processing strategy a is 10%. A value range may be set, where the value range is 1-100 (or 1-10, etc.), a value interval, such as 1-10 (or 90-100), with a value probability that is the same as the conversion processing probability corresponding to the conversion processing policy a is set in the value range, and then any one value in the value range of 1-100 is randomly generated. If the generated numerical value is 9 and the numerical value is between 1 and 10, the numerical value meets the value probability of 10 percent, namely meets the conversion processing probability corresponding to the conversion processing strategy A, the conversion processing strategy A is determined to be executed, and the conversion processing strategy A is used as a target conversion processing strategy; if the generated value is 50, the value is between 11 and 100, which means that the value does not satisfy the value probability of 10%, that is, does not satisfy the conversion processing probability corresponding to the conversion processing policy a, and thus it is determined that the conversion processing policy is not executed. Similarly, other conversion processing strategies may be processed correspondingly in the above manner.
Further, the determined target conversion processing strategy can be one or more. In the case where the target conversion processing strategies are plural, the target conversion processing strategies may be sequentially executed in a preset execution order to convert the written language text to be processed.
It should be noted that, because each conversion processing strategy has a corresponding conversion processing probability, and each conversion processing strategy also has a certain randomness, multiple conversion processes are performed on the same written language text to be processed, and the finally generated spoken language text is likely to be different. Therefore, in order to further expand the corpus, the conversion processing of the sentence forming unit can be performed on at least one written language text to be processed for multiple times, so as to obtain multiple spoken language texts corresponding to the written language text to be processed.
In addition, considering that a higher conversion processing probability is set for the conversion processing policy, the complexity of the sample corpus may be increased. The higher the complexity of the sample corpus, the more complicated the written language rewrite of the rewrite model obtained by training the sample corpus. Therefore, when there are a plurality of rewrite models for different text types, it is not suitable to perform complicated rewrite for a target spoken text of a fuzzy text type. Therefore, in the case of constructing a sample corpus of a rewriting model corresponding to a target spoken language text of a fuzzy text type, a lower conversion processing probability can be set for the conversion processing policy.
Following the above example, assuming that there are 4 conversion processing strategies for the written language text, the conversion processing probabilities corresponding to each conversion processing strategy for the written language text LT are determined as: 2%, 6%, 0.8%, 8%. Then, for each conversion processing strategy, a numerical range may be set for the corresponding conversion processing probability, and a value range corresponding to the conversion processing probability is set, and a number is randomly generated, and if the number is within the value range, the conversion processing strategy corresponding to the conversion processing probability is determined as a target conversion processing strategy, and the target conversion processing strategy is executed to perform conversion processing of a sentence component unit on the written language text LT, so as to obtain a spoken language text corresponding to the written language text LT.
In summary, the corresponding conversion processing probability is set for each conversion strategy, that is, each conversion strategy is executed according to a certain execution probability, so that it is not necessary to execute each conversion processing strategy intentionally, thereby ensuring the naturalness and reasonability of the written language conversion.
In particular, since the generation of the translated back written language text is to expand the written language text, it is necessary to perform spoken language conversion on the written language text and the translated back written language text, respectively. Therefore, any one of the written language text and the retranslated written language text can be used as the written language text to be processed, and the conversion processing of the sentence component unit is performed on the written language text to be processed, and in the case that the sentence component unit is a clause unit, the following steps 30806-2 to 30806-6 are specifically implemented:
30806-2, identifying the sentence in the written text to be processed to obtain the written sentence in the written text to be processed.
The sentence recognition is carried out on the written text to be processed, and the sentence division processing can be understood to be carried out on the text to be processed. In practical application, the sentence dividing (identifying) is performed through the sentence dividing symbol to obtain at least one written language sentence contained in the written language to be processed.
Step 30806-4, converting the written sentence into a clause unit to obtain a converted written sentence.
Furthermore, clause units are respectively converted for each identified written language sentence, and the converted written language sentence corresponding to each written language sentence can be obtained.
Specifically, because the conversion method for converting the clause unit into the written language sentence included in the written language text to be processed is various, in the embodiment of the present application, the written language sentence may be converted by the following two methods or a combination of the following two methods, including:
the method comprises the following steps: performing clause sampling on the written sentence according to a preset clause sampling rule to obtain a target clause in the written sentence; and converting the target clause in the written language sentence to obtain the converted written language sentence.
In practical applications, since a written sentence may include a plurality of clauses, and the clauses do not necessarily have expression differences between written languages and spoken languages, clauses that need to be converted in clause units may be selected from the written sentences, and then the selected clauses may be converted.
The preset clause sampling rule refers to a preset sampling rule for sampling clauses in written sentences, and the preset clause sampling rule may be random sampling, or sampling according to positions, for example, a clause with a sampling position arranged at a first position in a written sentence, or sampling according to the number of characters, for example, a clause with a number of characters less than 5 in a sampling clause, and the like, and is not limited herein. Correspondingly, the target clause refers to a clause obtained by sampling a written language sentence according to a preset clause sampling rule.
On the basis of obtaining the target clause, the target clause can be converted in the written language sentence, and in specific implementation, because the modes of converting the selected target clause are also various, in order to increase the naturalness and richness of the converted written language sentence, the target clause can be converted by the following three conversion modes or any combination of the following three conversion modes, and the method comprises the following steps:
mode A: and copying the target clause to obtain a copied target clause, and inserting the copied target clause into the written language sentence according to a preset clause inserting position to obtain the converted written language sentence.
In practical applications, when spoken language expression is used, some spoken language expression that is not included in written language sentences may occur, for example: good for the right, good and the like. In order to make the written language more consistent with the characteristics of the spoken language, some additional processing of colloquial clauses can be carried out on the written language sentences.
Specifically, the preset clause inserting position refers to a preset position for inserting the target clause into the written sentence, and the preset position may be set according to the actual spoken language characteristics, for example, the preset clause inserting position may be a beginning or an end of the written sentence, or may be before or after the position of the target clause in the written sentence.
Along the above example, assuming that the written language text LT is taken as the written language text to be processed, sentence recognition is performed on the written language text LT to obtain n written language sentences contained in the written language text LT, and the n written language sentences are written language sentences S1 and written language sentences S2 … … and written language sentences Sn, respectively. Taking the written sentence S1 as an example, the written sentence S1 "my hometown is shanxi and there is beauty" is sampled at random, and the target clause in the written sentence S1 is "my hometown is shanxi". The target clause is copied in the written language sentence S1 to obtain a copy target clause "my hometown is shanxi", and when the preset clause insertion position is before the position of the target clause, the copy target clause "my hometown is shanxi" is inserted in the written language sentence S1, and the converted written language sentence S13 is obtained as: "my home town is Shanxi, where it is beautiful".
Mode B: deleting the target clause in the written language sentence; and inserting the target clause into the deleted written language sentence according to a preset clause insertion rule to obtain the converted written language sentence.
In practical applications, the expression order of the clauses may not be intended in the case of spoken language expression, and therefore, the expression order of the clauses may not match the expression order of the written language sentence in the spoken language sentence. In order to make the converted written language more consistent with the characteristics of spoken language, some clauses of the written language sentence can be subjected to position adjustment processing.
Specifically, the preset clause insertion rule refers to a preset rule for inserting a target clause, and the rule may be set according to actual experience, for example, the preset clause insertion rule may be random insertion (that is, the preset clause is randomly inserted before or after any clause in the written sentence), may be inserted after the first clause, may also be inserted at the end of the sentence, and the like.
Since the conversion processing of method a and the conversion processing of method B can be selectively executed, the conversion processing of method B can be executed on the converted written language sentence obtained by method a, the conversion processing of method B can be executed directly on the original written language sentence, the conversion processing of method a can be executed on the converted written language sentence obtained by method B, and other conversion processing can be selectively executed and/or sequentially executed.
Following the above example, or taking the written sentence S1 as an example, the written sentence S1 "my hometown is shanxi, where is beautiful" is sampled with clauses, and the target clause in the written sentence S1 is obtained as "my hometown is shanxi". Deleting the target clause in the written language sentence S1, and randomly inserting the target clause 'my hometown is Shanxi' into any clause of the written language sentence S1 under the condition that a preset clause insertion rule is random insertion, and obtaining a converted written language sentence S13 as follows: "my home town is Shanxi, where it is beautiful, and my home town is Shanxi".
Mode C: carrying out syntactic analysis on the target clause to obtain a syntactic structure corresponding to the target clause; and converting the target clause according to the target syntactic structure corresponding to the syntactic structure to obtain the converted written language sentence.
In practical applications, although the grammatical structure (syntactic structure) of a clause is inconsistent, the meaning of the expression is the same, and thus, the word order in the clause may be inconsistent with the grammatical structure in the written language sentence in the spoken language sentence. Therefore, in order to make the converted written language more consistent with the characteristics of the spoken language, the grammatical structure of some clauses of the written language sentence can be changed, such as: and (5) flip-chip processing.
Specifically, the syntax analysis may be performed on the sampled target clause, and a syntax structure corresponding to the target clause may be obtained by using a rule-based syntax analysis method or a statistics-based syntax analysis method, where the syntax structure may be a predicate-object structure or a predicate-object structure, and the like, which is not limited herein. Accordingly, the target syntax structure refers to a syntax structure corresponding to the syntax structure of the target clause, which is set in advance. In particular, the syntax structure and the target syntax structure can be converted. For example, the syntax structure may be an active syntax structure of a principal predicate, and the target syntax structure may be a passive syntax structure of the principal predicate.
Following the above example, or randomly taking a sub-sentence sample of the written phrase sentence S1 "my home town is shanxi, where it is beautiful," and taking an example of the target sub-sentence in the written phrase sentence S1 being "my home town is shanxi". The syntax structure of the target clause is a main predicate object structure, and the target syntax structure corresponding to the syntax structure is an object predicate main structure. The target clause is converted into a guest-predicate main structure, and the converted target clause becomes: "Shanxi is my hometown. Accordingly, the converted written language sentence S13 is: "Shanxi is my hometown, where it is beautiful".
The second method comprises the following steps: determining clause position probability distribution corresponding to preset clauses contained in a preset clause set; determining a target preset clause and a clause adding position corresponding to the target preset clause in the preset clauses based on clause position probability distribution; and adding the target preset clause into the written sentence according to the clause adding position to obtain the converted written sentence.
The preset clause set refers to a preset set containing at least one spoken clause. Correspondingly, the preset clause refers to a clause included in the preset clause set. The phrase position probability distribution refers to position probability distribution of each preset phrase obtained by counting the occurrence positions (such as the beginning, end, or middle position of the phrase) of the preset phrases in a certain spoken language corpus in advance. In practical application, the frequency of each preset clause appearing at each position can be counted, and then the position probability distribution is calculated according to the counted frequency.
It is assumed that the preset clause set includes 3 preset clauses, and the 3 preset clauses are preset clause 1, preset clause 2, and preset clause 3, respectively. According to the statistics of the spoken language corpus in the sales field, the probability that the clause 1 is added to the beginning of the sentence is preset as follows: 60/(60+20+20) ═ 60%, the probability of the preset clause 2 being added to the end of the sentence is 20/(60+20+20) ═ 20%, and the probability of the preset clause 3 being added to the end of the sentence is also 20/(60+20+20) ═ 20%. The above 3 probabilities are the probability distribution of clause positions corresponding to the preset clauses.
Further, based on the clause position probability distribution, a target preset clause and a clause adding position corresponding to the target preset clause (a position for adding the target preset clause in the written sentence) can be determined in the preset clause. In specific implementation, a value range may also be preset, where the value range is 1-100 (or 1-10, etc.), and value intervals with value probabilities the same as the position probability distribution of the clauses, such as 1-60, 61-80, and 81-100, are set in the value range, and then any value in the value range of 1-100 is randomly generated. If the generated numerical value is 9 and the numerical value is between 1 and 60, the numerical value satisfies 60% of the value-taking probability, that is, the probability of adding the preset clause 1 to the sentence head is satisfied, and it is determined that the target preset clause is the preset clause 1 and the clause adding position corresponding to the target preset clause is the sentence head.
Still further, in a case where the target preset clause is "pairwise" and the target preset clause "pairwise" is added to the beginning of the written sentence S1, the converted written sentence S13 is obtained as: "Pair, my hometown is Shanxi, where it is beautiful".
Step 30806-6, determining a spoken text based on the converted written language sentence.
When there are a plurality of converted written language sentences, the converted written language sentences may be combined in the order of arrangement of the original written language sentences in the written language text to generate a spoken language text.
Following the above example, at least one conversion process is performed on at least one written language sentence among n written language sentences included in the written language text LT, the n converted written language sentences are obtained as written language sentences S13, written language sentences S23, … …, and written language sentences Sn3, respectively, and the n converted written language sentences are combined to generate the spoken language text ST 1.
In conclusion, the written sentence is rewritten in spoken language by converting the written sentence by copying clauses, disordering clauses and/or adding clauses in clause units, so that the converted written sentence is more consistent with the characteristics of the spoken language.
In addition, in the case that the sentence component unit is a word unit, the conversion method of performing the conversion processing of the clause unit on the written language text to be processed is also diversified, and the first implementation manner provided by the embodiment of the present application is specifically implemented as follows:
performing sentence recognition on the written text to be processed to obtain written sentences contained in the written text to be processed;
determining word position probability distribution corresponding to preset words contained in a preset word set;
determining a target preset word and a word adding position corresponding to the target preset word in the preset words according to the word position probability distribution, and inserting the target preset word into the written language sentence according to the word adding position to obtain a converted written language sentence;
determining spoken text based on the converted written language sentence.
In practical applications, since some spoken words are randomly added to the spoken expression, the spoken words may include: conjunctions, linguistic or other spoken words, etc., such as: ouabase, real, etc., these spoken words are not normally present in written language sentences. In order to make the written language more consistent with the characteristics of the spoken language, some addition processing of spoken words can be carried out on the written language sentences.
Specifically, the preset term set refers to a preset set including at least one spoken term. Correspondingly, the preset words refer to words contained in the preset word set. The term position probability distribution refers to position probability distribution of each preset term obtained by counting the occurrence positions (such as the beginning, end, or middle position of a sentence) of the preset terms in a certain speech corpus set in advance. In practical application, the occurrence frequency of each preset word at each position can be counted, and the position probability distribution is calculated according to the counted frequency.
Suppose that the preset word set includes 2 preset words, and the 2 preset words are preset word 1 and preset word 2, respectively. According to the statistics of the spoken language corpus in the sales field, the preset word 1 appears 80 times at the beginning of the sentence, the preset word 2 appears 20 times at the end of the sentence, and the probability that the preset word 1 is added to the beginning of the sentence is as follows: 80/(80+20) ═ 80%, the probability that preset word 2 is added to the end of the sentence is 20%, and then the above 2 probabilities are the word position probability distribution corresponding to the preset word.
Specifically, the specific implementation of the target preset word and the word adding position corresponding to the target preset word (the position where the target preset word is added in the written sentence) in the preset word is determined according to the word position probability distribution, and the specific implementation of the target preset clause and the clause adding position corresponding to the target preset clause in the preset clause based on the clause position probability distribution is referred to above, which is not described in detail herein.
On the basis of determining the target preset word and the word adding position corresponding to the target preset word, the target preset word can be added to the word adding position in the written sentence, and the converted written sentence is obtained, further, the specific implementation of the spoken text is determined based on the converted written sentence by referring to the conversion processing part in the clause unit, and the specific implementation of the spoken text is determined based on the converted written sentence, which is not limited herein.
Along with the above example, on the basis of n written language sentences contained in the written language text LT, the written language sentence S1 is taken as an example for explanation, the preset word set contains 2 preset words, the 2 preset words are preset word 1 and preset word 2 respectively, and the word position probability distribution of the 2 preset words is as follows: the probability of adding the preset word 1 to the beginning of the sentence is 80%, and the probability of adding the preset word 2 to the beginning of the sentence is 20%. It is assumed that a target preset word is determined as a preset word 1 and a word adding position corresponding to the target preset word is determined as a sentence start in a preset word set according to the word position probability distribution. In the case that the preset word 1 is "true", the preset word 1 is added to the beginning of the written sentence S1, and the converted written sentence S13 is obtained as: "actually my hometown is Shanxi, where it is beautiful". The n converted written sentences are then combined to generate a spoken text ST 1.
In conclusion, the word adding processing of the word unit is carried out on the written sentence, so that the written sentence is orally rewritten, and the converted written sentence is more consistent with the characteristics of the oral language.
In specific implementation, when spoken language is expressed, after some spoken words are added, the added words may be habitually repeated, so that, in order to make the converted written language more conform to the characteristics of spoken language, the added words in the written language sentence may be copied, which is specifically implemented by the following method in the embodiment of the present application:
copying the target preset words added in the converted written language sentences to obtain copied words;
inserting the copied words into the converted written language sentences according to preset word insertion rules to obtain the inserted written language sentences;
determining spoken text based on the inserted written language sentence.
Specifically, the preset word insertion rule refers to a preset rule for inserting the target preset word into the written language sentence, and the rule may be set according to actual spoken language characteristics, for example, the preset word insertion rule may be a position before or after the target preset word is inserted into the target preset word, or may be other positions where the target preset word is inserted into the written language sentence, which is not limited herein.
Following the above example, the written language sentence S13 after obtaining the conversion is: on the basis of "the real my country is shanxi and there is beauty", the target preset word "real" is copied to obtain a copied word "real", and in the case that the preset word insertion rule is inserted before the target preset word, the copied word "real" is added to the converted written language sentence S13 to obtain an inserted written language sentence S14, and the written language sentence S14 is "the real my country is shanxi and there is beauty". The n inserted written sentences are then combined to generate a spoken text ST 1.
In conclusion, the added words are repeatedly processed on the aspect of processing word addition processing of the word units on the written language sentences, so that the converted written language sentences can better accord with the characteristics of the spoken language.
In the case that the sentence component unit is a word unit, in addition to the conversion processing of the word unit, the second implementation manner provided by the embodiment of the present application is specifically implemented by the following processing manner:
carrying out sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text;
carrying out word sampling on words in the written language sentences according to preset word sampling rules to obtain target words in the written language sentences;
deleting the target words in the written language sentences, and inserting the target words into a preset insertion range corresponding to the target words in the deleted written language sentences to obtain converted written language sentences;
determining spoken text based on the converted written language sentence.
Since the expression order of words may not be intended in the spoken language expression, the expression order of words may not coincide with the expression order of written language sentences in the spoken language sentence. In order to make the converted written language more consistent with the characteristics of the spoken language, position adjustment processing can be performed on some words of the written language sentence.
Specifically, the preset word sampling rule refers to a preset sampling rule for sampling words to be out of order in written sentences, and the preset word sampling rule can be random sampling or sampling according to the number of preset characters, for example, the number of the sampled characters in the written sentences is 3 characters in terms of random sampling. Accordingly, the target word refers to a word sampled in the written language sentence by the word sampling rule.
The preset insertion range refers to a range in which insertion processing is performed, which is set in advance. The preset insertion range can be preset according to actual experience or spoken language expression habits, specifically, the preset insertion range corresponding to the target word can be a character interval from 3 characters of the target word before the position of the written language sentence to 3 characters of the target word after the position of the written language sentence, the character interval can be [ -3, 3] for short, and in addition, the preset insertion range can also be a clause range to which the word belongs. Further, the target words are randomly inserted in a preset insertion range.
Following the above example, on the basis of n written language sentences included in the written language text LT, the written language sentence S1 is taken as an example for explanation, a word is randomly sampled in the written language sentence S1, a target word is obtained as "there", the target word is deleted in the written language sentence S1, and the deleted written language sentence S1 is "my hometown is shanxi, which is beautiful". And in the case that the preset insertion range is the 'clause range to which the target word belongs', the target word is inserted into the preset insertion range of the written sentence S1, and the converted written sentence S13 is obtained as follows: "my hometown is Shanxi, where is much America". The n converted written sentences are then combined to generate a spoken text ST 1.
In conclusion, the written sentence is converted into the word unit, so that the converted written sentence is more in line with the characteristics of the spoken language.
In the case where sentence component units are character units, it is considered that sometimes the expression order of characters is not strictly followed in the spoken language expression process, and therefore a case may occur in which the expression order of characters does not coincide with the expression order of characters in a written language sentence. In order to make the converted written language more conform to the characteristics of the spoken language, some characters of the written language sentence can be subjected to position adjustment processing, and the embodiment of the application is specifically realized in the following way:
performing sentence recognition on the written text to be processed to obtain written sentences contained in the written text to be processed;
carrying out character sampling on characters in the written language sentence according to a preset character sampling rule to obtain target characters in the written language sentence;
deleting target characters in the written language sentences, and inserting the target characters into preset character insertion ranges corresponding to the target characters in the deleted written language sentences to obtain converted written language sentences;
determining spoken text based on the converted written language sentence.
Specifically, the preset character sampling rule refers to a preset sampling rule for sampling a character to be scrambled in a written sentence, and the preset character sampling rule may be a random sampling rule or a sampling rule according to a preset character position, for example, randomly sampling a character positioned at the 5 th position in the written sentence, and is not limited herein.
Accordingly, the preset character insertion range corresponding to the target character may be a character interval from the target character before the position of the written sentence by 3 characters to the target word after the position of the written sentence by 3 characters, and the character interval may be abbreviated as [ -3, 3], and the preset character insertion range may also be a clause range in which the target character is located, and the like, which is not limited herein.
Following the above example, on the basis of n written sentences included in the written language text LT, the written language sentence S1 is used as an example, characters are randomly sampled in the written language sentence S1, a target character is obtained as "american", the target character is deleted in the written language sentence S1, and the deleted written language sentence S1 is "my hometown is shanxi, where it is very. And under the condition that the preset character insertion range corresponding to the target character is 'the clause range to which the target character belongs', inserting the target character 'American' into the preset character insertion range of the written sentence S1, and obtaining a converted written sentence S13 as follows: "my hometown is Shanxi, where Mei is great". The n converted written sentences are then combined to generate a spoken text ST 1.
In conclusion, the written language sentences are subjected to character unit disorder processing, so that the converted written language sentences better accord with the characteristics of spoken language.
In the case that the sentence component units are symbol units, the symbols appearing in the spoken sentences may not be consistent with the symbols appearing in the written language sentences because the spoken expressions may not have an explicit division for the disconnection or connection of the sentences, or the division is arbitrary. In order to make the converted written language more conform to the characteristics of the spoken language, the conversion processing of the symbolic unit can be performed on the text of the written language to be processed by the following two modes or the combination of the following two modes, including:
the conversion method comprises the following steps: performing sentence recognition on the written text to be processed to obtain written sentences contained in the written text to be processed; carrying out symbol sampling on the written sentence according to a preset symbol sampling rule to obtain a target punctuation mark in the written sentence, deleting the target punctuation mark in the written sentence, and obtaining the converted written sentence; determining spoken text based on the converted written language sentence.
Specifically, the preset symbol sampling rule refers to a preset sampling rule for sampling a symbol to be deleted in a written sentence. The preset symbol sampling rule may be random sampling, or may be sampling according to a preset position, for example, a punctuation mark after a first clause in a written sentence is sampled, which is not limited herein. Correspondingly, the target punctuation mark refers to punctuation marks sampled from written language sentences according to a preset symbol sampling rule.
In the above example, on the basis of n written sentences included in the written language text LT, the written sentence S1 is used as an example for explanation, and the written sentence S1 is symbol-sampled at random to obtain a comma after the target punctuation in the written sentence S1 is the clause "my hometown is shanxi". Deleting the target punctuation mark in the written sentence S1 to obtain a converted written sentence S13: "my hometown is Shanxi where it is beautiful". The n converted written sentences are then combined to generate a spoken text ST 1.
And a second conversion method comprises the following steps: performing sentence recognition on the written text to be processed to obtain written sentences contained in the written text to be processed; performing symbol clause sampling on the written language sentence according to a preset symbol clause sampling rule to obtain a target symbol clause in the written language sentence, and inserting a preset punctuation mark into the target symbol clause to obtain a converted written language sentence; determining spoken text based on the converted written language sentence.
Specifically, the preset symbol clause sampling rule refers to a preset sampling rule for sampling a clause added with a symbol in a written sentence. The preset symbol clause sampling rule may be random sampling, or may be sampling according to the number of characters of a clause, for example, sampling a clause with the largest number of sampled characters in a written sentence, which is not limited herein. Correspondingly, the preset punctuation mark refers to a punctuation mark which is preset for insertion, and in practical application, the preset punctuation mark can be randomly inserted into the target symbol clause, or can be inserted according to a preset position, and is not limited herein; the target symbol clause refers to a clause sampled in the written language sentence according to a preset symbol clause sampling rule.
In the above example, the written sentence S1 is described as an example on the basis of n written sentences included in the written sentence text LT, and when the preset symbolic clause sampling rule is the longest character-sampled clause, symbolic clause sampling is performed on the written sentence S1, and the target symbolic clause in the written sentence S1 is obtained as "my hometown is shanxi". In the default punctuation mark is "! "in case of inserting the preset punctuation mark into the written sentence S1, the converted written sentence S13 is: "My home town is Shanxi! There is a beauty. The n converted written sentences are then combined to generate a spoken text ST 1.
In conclusion, the written sentence is subjected to conversion processing of deleting symbols and adding symbols of symbol units, so that the written sentence is subjected to spoken language rewriting, and the converted written sentence is more in line with the characteristics of spoken language.
Step 30808: and constructing a sample corpus based on the corresponding relation between the written language text and the translated written language text and the spoken language text.
Specifically, on the basis of the obtained spoken language text, the obtained spoken language text is obtained by converting the written language text or the retranslated written language text, so that a corresponding relationship exists between the spoken language text and the written language text or the retranslated written language text, and based on the corresponding relationship, the written language-spoken language text aligned sample corpus can be generated.
The sample corpus refers to a training sample pair for model training. In practice, a written language rewrite model may be used to train spoken text to written language text by generating a written language text-spoken text training sample pair. And under the condition of training the written language rewriting model, taking the spoken language text in the sample corpus as a training sample, and taking the written language text in the sample corpus as a sample label corresponding to the spoken language text.
In practical application, because some abnormal data may exist in the spoken language text obtained through conversion processing, the existence of the abnormal data seriously affects the quality of the spoken language text, and in order to guarantee the quality of the generated spoken language text, the abnormal data in the spoken language text can be subjected to data cleaning, which is specifically implemented in the following way in the embodiment of the application:
identifying abnormal information in the spoken language text;
cleaning the spoken language text according to the abnormal information to obtain a cleaned spoken language text;
and constructing a sample corpus based on the corresponding relation between the written language text and the translated written language text and the cleaned spoken language text.
The abnormal information may be wrongly written characters, repeated punctuation marks, Chinese punctuation mixed English punctuation marks, special marks, stop words and other abnormal information. In addition, the abnormal information may be semantically fuzzy or semantically unreasonable information, which is not limited herein. In practical application, the abnormal information in the spoken language text can be identified through a preset abnormal identification rule, and the abnormal information in the spoken language text can also be identified based on a pre-trained text cleaning model. In a specific implementation, the text cleaning model can be a deep context model for syntax error correction to perform syntax detection.
Furthermore, after the abnormal information in the spoken language text is identified, in the case that there are a plurality of spoken languages, the spoken language text with the abnormal information can be directly deleted, so that the spoken language text without the abnormal information (i.e., the cleaned spoken language text) is obtained. In addition, the abnormal information in the spoken language text may also be deleted or corrected, so as to obtain the cleaned spoken language text, which is not limited herein. If any spoken text is deleted, the corresponding written text or translated written text needs to be deleted.
In specific implementation, the written language text and the spoken language text can be subjected to data cleaning by considering that the written language text possibly adopted also contains abnormal information.
Following the above example, the spoken text ST1 is obtained by converting the written language text LT as the written language text to be processed, the spoken text ST2 is obtained by converting the translated back written language text LT3 as the text to be processed, and the anomaly information in the spoken text ST1 is recognized as "" and! If the spoken text ST2 is not abnormal, the spoken text ST1 is data-cleaned based on the abnormal information to obtain a cleaned spoken text ST1, and the spoken text ST2 is directly used as the cleaned spoken text ST 2. Based on the correspondence between the written language text LT and the cleaned spoken language text ST1, the written language text LT and the cleaned spoken language text ST1 construct a sample corpus pair 1. And based on the corresponding relation between the back-translated written language text LT3 and the cleaned spoken language text ST2, constructing a sample corpus pair 2 by the written language text LT3 and the cleaned spoken language text ST2, and taking the sample corpus pair 1 and the sample corpus pair 2 as sample corpora.
In conclusion, the data of the spoken language text after the conversion processing is cleaned, and the sample corpus is constructed through the cleaned spoken language text, so that the quality of the sample corpus is guaranteed, and the accuracy of model training is further improved.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating a sample corpus constructed in the text processing method according to an embodiment of the present application. After the written language text is obtained, in order to further expand the written language text, the written language text can be translated back, and in the translation process, the sentence in the written language text is analyzed according to lexical grammar, and the key entity words in the sentence are returned (replaced) according to the analysis result, so that the obtained translated language material is ensured to be consistent with the key information in the written language text. And then the retranslate linguistic data and the written language text are jointly used as a data source to be input into a spoken language data generating module for spoken language conversion. The spoken data generation module comprises the following steps of converting written language sentences in a data source into clause-level, word-level, character-level and symbol-level sentences.
The clause-level conversion processing comprises the conversion processing of clause repetition, clause generation, clause disorder and the like of written language sentences at the clause level; the word-level conversion processing comprises the conversion processing of adding words, repeating words, disordering words and the like to the written language sentences at the word level; the character-level conversion processing includes conversion processing such as character disordering processing and the like for written language sentences at the character level, and the symbol-level conversion processing includes conversion processing such as symbol deletion and symbol insertion for written language sentences at the symbol level.
After the data source is subjected to spoken language conversion through the spoken language data generation module, an initial spoken language text can be obtained, the initial spoken language text is subjected to data cleaning, abnormal information (namely information containing error information, error data or error punctuations) in the initial spoken language text is removed, and the spoken language text corresponding to the data source can be output.
According to the method and the device, the text structure and the syntactic and grammatical structure characteristics of the spoken text are summarized through research and analysis, the written language text based on the standard is subjected to translation processing, the written language text is expanded, then the expanded written language text is subjected to conversion processing, and the spoken expression of the corresponding written language text is generated.
Step 30810: and training the initial written language rewriting model through the sample corpora until the written language rewriting model meeting a second training stop condition is obtained.
Specifically, on the basis of the sample corpus, the initial written language rewriting model can be trained through the sample corpus, and after training is completed, the written language rewriting model which can be rewritten by written language can be generated.
The initial written language rewriting model may be a written language rewriting model to be trained, which is constructed based on a Seq2Seq model, wherein both an encoder and a decoder in the Seq2Seq model may be constructed by using a Transformer model. Accordingly, the second training stop condition is a condition for stopping the model training of the initial written language rewrite model based on the sample corpus. The second training stop condition may be that a loss value between a predicted written language text generated by rewriting a written language of a spoken language text in the sample corpus through the model and the sample written language text is smaller than a preset loss value, or that the training iteration number reaches a preset iteration number, for example, 5 times or 6 times, and is not limited herein. Accordingly, the written language rewrite model may be understood as a trained written language rewrite model for a spoken language text.
In the practical application, the loss function for calculating the model loss value can be a 0-1 loss function, an absolute value loss function, a square loss function, a cross entropy loss function and the like in the practical application, and the absolute value loss function is taken as an example for explanation, and the following formula 2 is referred to:
l (Y, f (x)) | Y-f (x)) | equation 2
Wherein L represents a loss value, f (x) represents a predicted written language text, and Y represents a sample written language text, and in the present application, the selection of the loss function is not limited to actual applications.
After the model loss value is calculated, the model parameters of the initial text classification model can be reversely adjusted according to the model loss value, the initial text classification model is continuously trained by sampling the sample corpora of the next batch until the training stopping condition is reached, and the written language rewriting model after training can be obtained.
In specific implementation, the written language rewriting model adopts rich sample corpora to perform model training, and can be used for processing more complicated sentence rewriting, so that the written language rewriting model can be used for performing relatively complicated rewriting processing on a target spoken language text of a standard text type.
In addition, on the basis of obtaining the text type corresponding to the target spoken language text, there is also a case where the text type is a fuzzy text type. In this case, in order to ensure the reasonability and accuracy of written language rewriting, the embodiment of the application is specifically realized by the following steps:
under the condition that the text type is the fuzzy text type, selecting a corresponding written language conversion model according to the fuzzy text type;
inputting the target spoken language text into a written language conversion model for processing to obtain a converted written language text corresponding to the target spoken language text;
the written language conversion model is obtained by training a basic spoken language text obtained by converting the written language text based on the written language text.
The target spoken language text is of a fuzzy text type, so that the semantic expression of the target spoken language text is fuzzy. For the target spoken language text of the fuzzy text type, a written language conversion model for slightly rewriting the spoken language text needs to be input to rewrite the written language, and a written language text (converted written language text) corresponding to the target spoken language text is obtained. This is because the semantic expression of the target spoken language text is fuzzy, and if the target spoken language text is rewritten in a complicated way, the semantic expression of the target spoken language text may be fuzzy or prone to deviation. Therefore, for the target spoken text of the fuzzy text type, it is sufficient to rewrite the target spoken text with a written language conversion model by simple spoken words, mood words, and the like.
And assuming that the text type corresponding to the target spoken text TST output by the text classification model is a fuzzy text type, inputting the target spoken text TST into a written language conversion model, and obtaining a target written language text TLT2 output by the written language conversion model.
In conclusion, the target spoken language text of the fuzzy text type is slightly rewritten through the written language conversion model, so that the reasonable rewriting of the spoken language texts of different types is realized, and the quality of written language rewriting is guaranteed.
In specific implementation, the training of the written language conversion model is realized by the following steps:
acquiring a written language text;
carrying out conversion processing of sentence composition units on the written language text to obtain a basic spoken language text;
constructing a basic sample corpus based on the corresponding relation between the written language text and the basic spoken language text;
and training the initial written language conversion model through the basic sample corpus until the written language conversion model meeting the first training stop condition is obtained.
The basic spoken language text is a spoken language text generated by converting the acquired written language text. In practical application, the written language text is acquired, the sentence forming unit conversion processing is performed on the written language text, and the specific implementation of the basic spoken language text is acquired.
Correspondingly, the basic sample corpus refers to a sample corpus constructed by taking a spoken language text as a training sample and taking a written language text corresponding to the spoken language text as a sample label. The first training stop condition is a condition for stopping model training of the initial written language rewrite model based on the base sample corpus. Similarly, the first training stop condition may be that a loss value between a predicted written language text generated by performing written language rewriting on the spoken language text in the sample corpus through the model and the sample written language text is smaller than a preset loss value, or that the training iteration number reaches a preset iteration number, such as 5 times or 6 times, and is not limited herein. Accordingly, the written language conversion model can be understood as a model for performing written language rewriting on the spoken text based on basic sample corpus training.
In specific implementation, the initial written language conversion model is trained through the basic sample corpus until a specific implementation manner of the written language conversion model meeting the first training stop condition is obtained, and the initial written language conversion model is trained through the sample corpus until a specific implementation manner of the written language conversion model meeting the second training stop condition is obtained, which is similar to the above-mentioned specific implementation manner, and the above-mentioned implementation manner is referred to, and no further description is given here.
It should be noted that the written language text in the basic sample corpus is not expanded by the translation process, so the basic sample corpus is relatively simplified compared to the constructed sample corpus. Therefore, the written language rewrite of the written language conversion model obtained by model training of the basic sample corpus is simpler than that of the written language rewrite model.
In conclusion, the written language conversion model is trained through the basic sample corpus, so that the written language conversion model can slightly rewrite the target spoken language text of the fuzzy text type, and the written language rewriting is more reasonable.
Further, there is also a possibility that the text type is an invalid text type, and in the case where the text type is an invalid text type, the target spoken text is deleted. The target spoken text, due to the invalid text type, is indicated to be spoken text that does not contain semantic information. Written language rewriting is carried out on the type of spoken language text, and the obtained rewriting result has no semantic information. Thus, the target spoken text for the invalid text type can be deleted directly, i.e., without requiring written rewriting thereof.
In conclusion, the target spoken language text of the invalid text type is directly deleted, so that the waste of computing resources for processing the invalid spoken language text is avoided. Thereby saving computational costs.
According to the text processing method provided by the embodiment of the application, the target spoken language text is obtained; classifying the target spoken language text to obtain a text type corresponding to the target spoken language text; under the condition that the text type is the standard text type, selecting a corresponding written language rewriting model according to the standard text type, so that the written language rewriting model suitable for the target spoken language text is selected according to the text type of the target spoken language text; and then the target spoken language text is input into the written language rewriting model to be processed, so that the target written language text corresponding to the target spoken language text is obtained, the written language rewriting is more targeted, and the written language rewriting accuracy is improved. The written language rewriting model is obtained by training the spoken language texts obtained by performing retracing and conversion processing on the written language texts based on the written language texts, and the written language texts are preprocessed based on the retracing and conversion processing, so that a large number of sample corpora of the spoken language texts and the written language texts are provided for model training, the training difficulty of the model is simplified, manual labor-consuming and time-consuming collection and processing of a large number of text data are avoided, and time cost and labor cost are saved.
The following describes the text processing method with reference to fig. 5 by taking an application of the text processing method provided in the present application in an actual scene as an example. Fig. 5 shows a processing flow chart of a text processing method applied to an actual scene according to an embodiment of the present application, which specifically includes the following steps:
step 502: and acquiring written language text.
Specifically, the written language text may be written language text of any field, such as written language text of medical field, written language text of chemical field, written language text of sales field, written language text of daily life field, written language text of travel field, and the like, without limitation. And the number of the text of the written language text may be one or more, which is not limited herein.
Taking the sales field as an example, the written language text T of the sales field is obtained.
Step 504: and identifying key words with parts of speech being preset parts of speech in the written language text by analyzing the parts of speech of the written language text.
And performing part-of-speech analysis on each word contained in the written language text T to obtain the part-of-speech of each word in the written language text T. And under the condition that the preset part of speech is the part of speech of a noun, identifying the words of the part of speech of the noun in the written language text T as key words.
Step 506: the positions of the key words are marked in the written language text.
Based on this, assuming that the key words identified in the written language text T are "computer" and "speed", the written language sentences SS to which these key words belong in the written language text T are: "i use the computer, the speed is very fast, and it is very convenient", carry on the position mark in the written language text T through the asterisk "+", the written language text T after the mark is finished is got after the mark. The written language sentence SS in the marked written language text T is changed to: "i use computer, speed is fast and very convenient".
Step 508: and translating the marked written language text into a translated written language text corresponding to the preset language.
Specifically, the preset language may be any one or more languages such as english, french, korean, and the like, which is not limited herein.
Based on this, when the preset language is english, the marked written language text T is translated into english, and an english translation written language text T1 corresponding to the marked written language text T is obtained.
Step 510: and translating the translated written language text into the target language to which the written language text belongs to obtain an initial retraced written language text.
Specifically, since the target language to which the text content in the written language text T belongs is chinese, the english translation written language text T1 is translated into chinese, and the initial retranslate written language text T2 corresponding to the english translation written language text T1 is obtained, wherein the written language sentence SS2 corresponding to the written language sentence SS in the initial retranslate written language text T2 is updated as follows: "i adopt computers, efficiency is fast and very convenient".
Step 512: and replacing the target key words corresponding to the position marks in the initial retranslate written language text by the key words to obtain the retranslate written language text.
Specifically, the target keyword corresponding to the position mark refers to a word marked by the position mark in the initial translated written language text, and the target keyword also corresponds to the keyword. In practical application, in combination with the part-of-speech analysis of the written language text, the position of a word with a specific part-of-speech in the written language sentence is marked and replaced in the retracing process, so that the retracing of the written language text and the key information in the written language text are ensured to be unchanged as much as possible.
Based on this, the target key words corresponding to the marked positions in the initial retracing written language text T2 are 'computer' and 'efficiency'; replacing "computer" in the initial retranslate written language text T2 by the keyword word "computer" and replacing "efficiency" in the initial retranslate written language text T2 by the keyword word "speed" to obtain a retranslate written language text T3, wherein a written language sentence SS3 corresponding to the written language sentence SS in the retranslate written language text T3 is updated as: 'I adopts a computer to carry out operation, has high speed and is very convenient'.
Step 514: and taking each written language text in the written language text and the retranslated written language text as the written text to be processed, and performing sentence recognition on each written language text to be processed to obtain written language sentences contained in each written language text to be processed.
Specifically, each written language text in the written language text T and the retranslated written language text T3 is used as a written text to be processed, and sentence recognition is sequentially performed on each text to be processed to obtain written language sentences contained in each text to be processed. Further, the following steps 516 to 522 are performed for each written sentence in each text to be processed.
Based on this, it is assumed that the written language text T is used as the text T to be processed, the text T to be processed is subjected to sentence recognition, n written language sentences included in the text T to be processed are obtained, the n written language sentences are written language sentence 1, written language sentences 2, … …, and written language sentence n, and the following steps 516 to 522 are performed on the n written language sentences.
Step 516: the written sentence is converted into a clause unit, and a converted A4 written sentence is obtained.
Specifically, the clause unit conversion processing is performed on any written language sentence, and is specifically realized by executing the following steps 516-1 to 516-18:
step 516-1: and determining the copy clause conversion processing probability corresponding to the copy clause conversion processing strategy in each clause conversion processing strategy of the written language sentence.
The phrase conversion processing policy refers to a processing policy for copying phrases in written sentences, and the phrase conversion processing probability refers to a preset probability for executing the phrase conversion processing policy. The probability of the sentence transfer processing for duplication may be preset according to actual experience or spoken language expression habits, for example, the probability of the sentence transfer processing for duplication may be 10%, 20%, 30%, or the like, and is not limited herein. In the case where the probability of the duplicate clause conversion processing is 10%, it indicates that the duplicate clause conversion processing policy is executed with a probability of 10% for the written language sentence.
Based on this, assuming that the written sentence 1 is the written sentence SS "i use a computer, which is fast and convenient", and the probability of the duplicate clause conversion processing corresponding to the duplicate clause conversion processing policy is 10%, the probability of the duplicate clause conversion processing for executing the duplicate clause conversion processing policy with respect to the written sentence 1 is 10%.
Step 516-2: whether to execute a copy clause conversion processing policy for the written language sentence is determined based on the copy clause conversion processing probability.
Specifically, if it is determined to execute the duplicate clause conversion processing policy based on the duplicate clause conversion processing probability, the following step 516-3 is executed; if it is determined that the copy clause line feed processing policy is not to be executed, the written sentence is directly used as the A1 th written sentence, and the following step 516-5 is executed.
Step 516-3: under the condition that a strategy for executing the conversion processing of the duplicate clauses is determined, clause sampling is carried out on the written language sentences according to a first preset sampling rule, and a first target clause in the written language sentences is obtained.
Specifically, the first preset sampling rule is a preset sampling rule for sampling a clause to be copied in a written language sentence. The first preset sampling rule may be random sampling, or sampling according to a position, for example, a clause whose sampling position is arranged at a first position in a written sentence, or sampling according to a number of characters, for example, a clause whose number of characters in a sampling clause is less than 5. The first preset sampling rule may be the same as the preset clause sampling rule in the method embodiment, and may also be understood as one of the preset clause sampling rules in the method embodiment. Accordingly, the first target clause refers to a clause sampled in the written language sentence according to the first preset sampling rule, and may also be understood as a target clause in the above method embodiment.
Based on this, in the case where it is determined to execute the strategy of duplicate clause conversion processing, clause sampling is randomly performed on the written sentence 1, and the first target clause in the written sentence 1 is obtained as "fast speed".
Step 516-4: and copying the first target clause to obtain a copy target clause, inserting the copy target clause into the written language sentence according to a preset clause inserting position, and obtaining a converted A1 written language sentence.
Specifically, the preset clause inserting position refers to a preset position for inserting the target first clause into the written sentence, and the preset position may be set according to the actual spoken language characteristics, for example, the preset clause inserting position may be a beginning or an end of the written sentence, or may be before or after the position of the first target clause in the written sentence, which is not limited herein.
Based on this, the first target clause is copied at a high speed, the obtained copied target clause is also copied at a high speed, and when the preset clause insertion position is before the position of the first target clause, the copied target clause is inserted before the position of the first target clause in the written sentence 1, and the converted written sentence a1 is obtained as follows: "I use the computer, it is fast, very fast, and very convenient".
Step 516-5: and determining the added clause conversion processing probability corresponding to the added clause conversion processing strategy in each clause conversion processing strategy of the A1 written language sentence.
The addition clause conversion processing strategy refers to a processing strategy for adding clauses to written language sentences. Accordingly, the added clause conversion processing probability refers to a preset probability for executing the processing related to the added clause conversion processing policy, and the added clause conversion processing probability may also be preset according to actual experience or spoken language expression habits, for example, the added clause conversion processing probability may be 15%, 20%, or the like, and is not limited herein. In the case where the added clause conversion processing probability is 15%, it indicates that the added clause conversion processing policy is executed with a probability of 15% for the written language sentence.
Based on this, the added clause conversion processing probability corresponding to the added clause conversion processing policy of the a1 th written language sentence was determined to be 15%.
Step 516-6: based on the addition clause conversion processing probability, it is determined whether or not an addition clause conversion processing policy is executed for the a1 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the add clause conversion processing policy for the a1 th written sentence based on the add clause conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is not repeated herein.
In specific implementation, if the add clause conversion processing policy is determined to be executed, the following step 516-7 is executed; if it is determined that the add clause conversion processing strategy is not to be executed, the following steps 516-9 are executed by directly regarding the A1 th written language sentence as the A2 th written language sentence.
Based on this, it is assumed that the add clause conversion processing policy is determined to be executed for the a1 th written language sentence based on the add clause conversion processing probability of 15%.
Step 516-7: and under the condition of determining to execute the strategy of adding clause conversion processing, determining clause position probability distribution corresponding to the preset clauses contained in the preset clause set.
Specifically, the preset clause set includes 3 preset clauses, and the 3 preset clauses are preset clause 1, preset clause 2, and preset clause 3, respectively. The clause position probability distribution corresponding to the 3 preset clauses is as follows: the probability of adding preset clause 1 to the beginning of a sentence is 60%, the probability of adding preset clause 2 to the end of a sentence is 20%, and the probability of adding preset clause 3 to the beginning of a sentence is 20%.
Step 516-8: and determining a target preset clause and a clause adding position corresponding to the target preset clause in the preset clauses based on clause position probability distribution, and adding the target preset clause to the A1 written language sentence according to the clause adding position to obtain the converted A2 written language sentence.
Specifically, based on the clause position probability distribution, it is determined that the target preset clause is preset clause 1 and the clause adding position corresponding to the preset clause 1 is the clause head in the preset clause, the preset clause 1 is added to the clause head of the a1 th written language sentence, and under the condition that the preset clause 1 is "pair", the converted a2 th written language sentence is "pair", i use a computer, which is fast, fast and convenient.
Step 516-9: and determining the out-of-order clause conversion processing probability corresponding to the out-of-order clause conversion processing strategy in each clause conversion processing strategy of the A2 th written sentence.
The out-of-order clause conversion processing strategy refers to a processing strategy for out-of-order clauses in written language sentences. Accordingly, the out-of-order clause transformation processing probability refers to a preset probability for executing the processing related to the out-of-order clause transformation processing strategy, and may also be preset according to actual experience or spoken language expression habits, for example, the out-of-order clause transformation processing probability may be 5%, 10%, and the like, and is not limited herein. In the case where the probability of the out-of-order clause conversion processing is 5%, it indicates that the out-of-order clause conversion processing policy is executed with a probability of 5% for the written language sentence.
Based on this, the unordered clause conversion processing probability corresponding to the unordered clause conversion processing policy of the a2 th written sentence is determined to be 5%.
Step 516-10: based on the out-of-order clause conversion processing probability, it is determined whether or not an out-of-order clause conversion processing policy is executed for the a2 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the disorder clause conversion processing policy for the a2 th written sentence based on the disorder clause conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is not repeated herein.
In specific implementation, if the out-of-order clause conversion processing strategy is determined to be executed, the following steps 516-11 are executed; if it is determined that the out-of-order clause conversion processing strategy is not to be executed, the A2 written language sentence is directly used as the A3 written language sentence, and the following steps 516-13 are executed.
Based on this, it is assumed that the out-of-order clause conversion processing policy is determined to be executed for the a2 th written language sentence based on the added clause conversion processing probability of 5%.
Step 516-11: under the condition that the out-of-order clause conversion processing strategy is determined to be executed, clause sampling is carried out on the A2 th written language sentence according to a second preset sampling rule, and a second target clause in the A2 th written language sentence is obtained.
Specifically, the second preset sampling rule refers to a preset sampling rule for sampling a clause to be disorderly in a written sentence, and the second preset sampling rule may be random sampling, or sampling according to a position, for example, a clause with a sampling position arranged at the last position in the written sentence, or sampling according to a number of characters, for example, a clause with a number of characters less than 5 in the sampling clause, and the like, which is not limited herein. In practical applications, the second preset sampling rule may be the same as the first preset sampling rule, or may be different from the first preset sampling rule, which is not limited herein. The second preset sampling rule may also be the same as the preset clause sampling rule in the above method embodiment, or may be understood as one of the preset clause sampling rules in the above method embodiment. Accordingly, the second target clause refers to a clause sampled in the written language sentence according to the second preset sampling rule, and may also be understood as a target clause in the above method embodiment.
Based on this, in the case where it is determined that the out-of-order clause conversion processing policy is to be executed, clause sampling is randomly performed on the a2 th written language sentence, and the second target clause in the a2 th written language sentence is obtained as the "pair-to-pair".
Step 516-12: and deleting the second target clause in the A2 th written language sentence, and inserting the second target clause into the deleted A2 th written language sentence according to preset clause inserting rules to obtain the converted A3 th written language sentence.
Specifically, the second target clause "pairwise pair" is deleted in the a2 th written language sentence, and the deleted a2 th written language sentence is obtained: "I use the computer, it is fast, very fast, and very convenient". And randomly inserting the second target clause 'pairwise' into the deleted A2 written language sentence to obtain a converted A3 written language sentence as follows: "I use the computer, fast, and very convenient, right to right".
Step 516-13: and determining flip-chip clause conversion processing probabilities corresponding to the flip-chip clause conversion processing strategies in the clause conversion processing strategies of the A3-th written sentence.
The inverted clause conversion processing strategy refers to a processing strategy for inverting the word order of clauses in written language sentences (for example, inverting a main predicate object structure into a guest predicate object structure). In practical applications, although the word order of the clauses is not consistent, the meaning of the expression is the same, so that the word order in the clauses may not be consistent with the word order in the written language sentences in the spoken language sentences. In order to make the converted written language more consistent with the characteristics of the spoken language, the language order of some clauses of the written language sentence can be reversely processed. Accordingly, the probability of the flip-chip clause transformation processing is a preset probability of executing the processing related to the policy of the flip-chip clause transformation processing, and the probability of the flip-chip clause transformation processing may also be preset according to practical experience or oral expression habits, for example, the probability of the flip-chip clause transformation processing may be 3% or 5%, and is not limited herein. In the case where the flip-chip clause conversion processing probability is 3%, it indicates that the flip-chip clause conversion processing policy is performed with a probability of 3% for written language sentences.
Based on this, the flip clause conversion processing probability corresponding to the flip clause conversion processing policy of the a3 th written sentence was determined to be 3%.
Step 516-14: based on the flip clause conversion processing probability, it is determined whether or not a flip clause conversion processing policy is executed for the a3 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the flip-chip clause conversion processing strategy for the a3 th written sentence based on the flip-chip clause conversion processing probability is similar to the specific implementation manner of determining whether to execute the flip-chip clause conversion processing strategy for the written sentence based on the flip-chip clause conversion processing probability, and the specific implementation manner of determining whether to execute the flip-chip clause conversion processing strategy for the written sentence based on the flip-chip clause conversion processing probability is determined by referring to the specific implementation manner of determining the flip-chip clause conversion processing strategy for the written sentence, which is not described herein again.
In specific implementation, if it is determined to execute the inverted clause transformation processing strategy, the following steps 516-15 are executed; if it is determined that the flip-chip clause conversion processing strategy is not to be performed, the following step 518 is performed with the A3 th written language sentence as the A4 th written language sentence.
Based on this, it is assumed that the out-of-order clause conversion processing policy is determined to be executed for the a2 th written language sentence based on the added clause conversion processing probability of 5%.
Step 516-15: and under the condition that the strategy of executing the inversion clause conversion processing is determined, clause sampling is carried out on the A3 th written language sentence according to a third preset sampling rule, and a third target clause in the A3 th written language sentence is obtained.
Specifically, the third preset sampling rule refers to a preset sampling rule for sampling a clause to be flipped in a written sentence, and the third preset sampling rule may be random sampling, or sampling according to a position, for example, a clause arranged at a beginning position of a sentence in the written sentence, or sampling according to a number of characters, for example, a clause with a number of characters greater than 5 in the sampling clause, and the like, which is not limited herein. In practical applications, the third preset sampling rule may be the same as the first preset sampling rule or the second preset sampling rule, or may be different from the first preset sampling rule or the second preset sampling rule, which is not limited herein. The third preset sampling rule may also be the same as the preset clause sampling rule in the above method embodiment, or may be understood as one of the preset clause sampling rules in the above method embodiment. Accordingly, the third target clause refers to a clause sampled in the written language sentence according to the third preset sampling rule, and may also be understood as a target clause in the above method embodiment.
Based on this, in the case where it is determined that the strategy of flip-chip clause conversion processing is executed, clause sampling is randomly performed on the A3 th written phrase sentence, and the third target clause in the A3 th written phrase sentence is obtained as "i use computer".
Step 516-17: and carrying out syntactic analysis on the third target clause to obtain a syntactic structure corresponding to the third target clause.
Specifically, the syntax analysis is performed on the third target clause, and a syntax structure corresponding to the third target clause is obtained as a major-predicate structure.
Step 516-18: and converting the third target clause according to the target syntactic structure corresponding to the syntactic structure to obtain a converted A4 written language sentence.
Specifically, when the target syntax structure corresponding to the major-predicate object structure is the major-predicate object structure, the third target clause is converted according to the major-predicate object structure, and the obtained a 4-th written language statement after conversion is: "the computer is used by me, the speed is very fast, and it is very convenient, right to right".
Step 518: and performing word unit conversion processing on the A4 th written language sentence to obtain a B3 th written language sentence.
Specifically, on the basis of performing clause unit conversion processing on the written sentence to obtain the a4 th written sentence, the word unit conversion processing is performed on the a4 th written sentence, and the following steps 518-1 to 518-12 are performed as follows:
step 518-1: determining the added word conversion processing probability corresponding to the added word conversion processing strategy in each word conversion processing strategy of the A4 th written language sentence.
Specifically, the adding of the word conversion processing policy refers to a processing policy of adding words to the written sentence. Accordingly, the addition word conversion processing probability refers to a probability of executing an addition word conversion processing policy set in advance. The added word conversion processing probability may be preset according to actual experience or spoken language expression habits, for example, the added word conversion processing probability may be 10% or 13%, and is not limited herein. In the case where the added word conversion processing probability is 10%, it indicates that the added word conversion processing policy is executed with a probability of 10% for the written language sentence.
Based on this, determine the a4 written language sentence: the "computer is used by me at a high speed, and a high convenience, and the added word conversion processing probability corresponding to the added word conversion processing policy in the word conversion processing policy for the pair" is 10%, and the added word conversion processing probability for executing the added word conversion processing policy for the a4 th written language sentence is 10%.
Step 518-2: based on the addition word conversion processing probability, it is determined whether or not an addition word conversion processing policy is executed for the a4 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the add word conversion processing policy for the a4 th written sentence based on the add word conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is not repeated here.
In specific implementation, if the adding word conversion processing strategy is determined to be executed, the following step 518-3 is executed; if it is determined that the add word conversion processing strategy is not to be performed, the following step 518-5 is performed by directly regarding the A4 th written language sentence as the B1 th written language sentence.
Based on this, it is assumed that the addition word conversion processing policy is determined to be executed for the a4 th written language sentence based on the addition word conversion processing probability of 10%.
Step 518-3: and under the condition that the added word conversion processing strategy is determined to be executed, determining word position probability distribution corresponding to preset words contained in the preset word set.
Specifically, 2 preset words are included in the preset word set, and the 2 preset words are preset word 1 and preset word 2 respectively. According to the statistics of the spoken language corpus in the sales field, the probability distribution of the word positions corresponding to the two preset words is as follows: the probability of adding word 1 to the beginning of a sentence is preset to be 80%, and the probability of adding word 2 to the end of a sentence is preset to be 20%.
Step 518-4: and determining target preset words and word adding positions corresponding to the target preset words in the preset words according to the word position probability distribution, and adding the target preset words to the A4 written language sentences according to the word adding positions to obtain the converted B1 written language sentences.
Specifically, based on the word position probability distribution, the preset target word is determined to be preset word 1 in the preset words and the preset adding position corresponding to the preset word 1 is determined to be a sentence head, then the preset word 1 is added to the sentence head of the A4 written language sentence, and under the condition that the preset word 1 is 'Java' the converted B1 written language sentence is obtained as 'computer used by me, Java, speed is very fast, and the method is very convenient and fast and is right-to-right'.
Step 518-5: a duplicate word conversion processing probability corresponding to the duplicate word conversion processing policy among the respective word conversion processing policies of the B1-th written language sentence is determined.
The duplication term conversion processing policy refers to a processing policy for duplicating the target preset term added in the step 518-4 in the written sentence. Accordingly, the duplication term conversion processing probability refers to a preset probability for executing the processing related to the duplication term conversion processing policy, and the duplication term conversion processing probability may also be preset according to actual experience or spoken language expression habits, for example, the duplication term conversion processing probability may be 8% or 12%, and is not limited herein. In the case where the probability of duplicate word conversion processing is 8%, it indicates that the duplicate word conversion processing policy is executed with a probability of 8% for the written language sentence.
Based on this, the duplicate word conversion processing probability corresponding to the duplicate word conversion processing policy of the B1-th written language sentence was determined to be 8%.
Step 518-6: based on the duplication word conversion processing probability, it is determined whether or not a duplication word conversion processing policy is executed for the B1 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the copy term conversion processing policy for the written sentence of B1 based on the copy term conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy term conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is not repeated herein.
In specific implementation, if the duplicate word conversion processing policy is determined to be executed, the following step 518-7 is executed; if it is determined that the duplicate word conversion processing policy is not to be executed, the following step 518-8 is performed by directly regarding the B1 th written language sentence as the B2 th written language sentence.
Based on this, it is assumed that the duplication word conversion processing policy is determined to be executed for the B1 th written language sentence based on the duplication word conversion processing probability of 8%.
Step 518-7: under the condition that the strategy of executing the conversion processing of the duplicated words is determined, the target preset words added in the B1 th written language sentence are duplicated to obtain the duplicated words, the duplicated words are inserted into the B1 th written language sentence according to preset word insertion rules, and the inserted B2 th written language sentence is obtained.
Based on this, under the condition that the strategy of converting and processing the copied terms is determined to be executed, the target preset term 'java plug' in the B1 th written language sentence is copied, the obtained copied terms are also 'java plug', and under the condition that the preset term insertion rule is that the target preset term is inserted before the position of the target preset term, the inserted B2 th written language sentence is obtained by inserting the copied terms before the position of the target preset term in the B1 th written language sentence: 'computer is used by me, and the Java is quick, fast, convenient and opposite'.
Step 518-8: and determining the out-of-order word conversion processing probability corresponding to the out-of-order word conversion processing strategy in each word conversion processing strategy of the B2 written language sentence.
The out-of-order word conversion processing strategy refers to a processing strategy for out-of-order words in written language sentences. Correspondingly, the out-of-order word conversion processing probability refers to a preset probability for executing relevant processing of an out-of-order word conversion processing strategy, and the out-of-order word conversion processing probability may also be preset according to actual experience or a spoken language expression habit, for example, the out-of-order word conversion processing probability may be 6%, 9%, and the like, which is not limited herein. In the case where the out-of-order word conversion processing probability is 6%, it indicates that the out-of-order word conversion processing policy is executed with a probability of 6% for the written language sentence.
Based on this, the out-of-order word conversion processing probability corresponding to the out-of-order word conversion processing policy of the B2 th written language sentence is determined to be 6%.
Step 518-9: based on the out-of-order word conversion processing probability, it is determined whether to execute an out-of-order word conversion processing policy for the B2 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the out-of-order word conversion processing policy for the B2 th written sentence based on the out-of-order word conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability is referred to, which is not described herein again.
In specific implementation, if the out-of-order word conversion processing strategy is determined to be executed, executing the following step 518-10; if it is determined that the out-of-order word conversion processing strategy is not to be executed, the B2 th written language sentence is directly used as the B3 th written language sentence, and the following step 520 is executed.
Based on this, it is assumed that the out-of-order word conversion processing policy is determined to be executed for the B2 th written language sentence based on the out-of-order word conversion processing probability of 5%.
Step 518-10: under the condition that the out-of-order word conversion processing strategy is determined to be executed, word sampling is carried out on words in the B2 th written language sentence according to preset word sampling rules, and target words in the B2 th written language sentence are obtained. Based on this, in the case of determining the out-of-order word conversion processing strategy to be executed, 2-character number word samples are randomly taken for the B2 th written language sentence, and the target word in the B2 th written language sentence is obtained as "use".
Step 518-11: and deleting the target word in the B2 written language sentence, and inserting the target word into a preset insertion range corresponding to the target word in the deleted B2 written language sentence to obtain the converted B3 written language sentence.
Specifically, the target word "use" is deleted in the B2 th written language sentence, and the deleted B2 th written language sentence is obtained: "the computer is my, fast, very fast and very convenient". And randomly inserting the target word 'calculation' into the character interval of [ -3, 3] in the deleted B2 written language sentence, and obtaining a converted B3 written language sentence as follows: ' computer use is my, Java and Java ' plug, the speed is very fast, and it is very convenient, to right '.
Step 520: and performing character unit conversion processing on the B3 written language sentence to obtain a C written language sentence.
Specifically, on the basis of performing the word unit conversion processing on the written language sentence to obtain the written language sentence B3, the character unit conversion processing is performed on the written language sentence B3, and the following steps 520-1 to 520-4 are performed as follows:
step 520-1: and determining the out-of-order character conversion processing probability corresponding to the out-of-order character conversion processing strategy of the B3 written language sentence.
Specifically, the out-of-order character conversion processing strategy refers to a processing strategy for out-of-order characters in written language sentences. Accordingly, the out-of-order character conversion processing probability refers to a preset probability for executing the relevant processing of the out-of-order character conversion processing strategy, and the out-of-order character conversion processing probability may also be preset according to actual experience or spoken language expression habits, for example, the out-of-order character conversion processing probability may be 5%, 9%, and the like, and is not limited herein. In the case where the out-of-order character conversion processing probability is 5%, it indicates that the out-of-order character conversion processing policy is executed with a probability of 6% for the written language sentence.
Based on this, the out-of-order character conversion processing probability corresponding to the out-of-order character conversion processing policy of the B3 th written language sentence is determined to be 5%.
Step 520-2: based on the out-of-order character conversion processing probability, it is determined whether to execute an out-of-order character conversion processing policy for the B3 th written language sentence.
Specifically, the specific implementation manner of determining whether to execute the disorder character conversion processing policy for the B3 th written sentence based on the disorder character conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is not repeated herein.
In specific implementation, if the out-of-order character conversion processing strategy is determined to be executed, the following step 520-3 is executed; if it is determined that the out-of-order character conversion processing strategy is not to be executed, the following step 522 is executed by directly regarding the B3 th written language sentence as the C written language sentence.
Based on this, it is assumed that the out-of-order character conversion processing policy is determined to be executed for the B3 th written language sentence based on the out-of-order character conversion processing probability of 5%.
Step 520-3: under the condition that the out-of-order character conversion processing strategy is determined to be executed, character sampling is carried out on characters in the B3 written language sentence according to preset character sampling rules, and target characters in the B3 written language sentence are obtained. Based on this, in the case of determining the strategy for executing the out-of-order character conversion processing, the B3 th written language sentence is randomly subjected to character sampling, and the target character in the B3 th written language sentence is obtained as "make".
Step 520-4: and deleting the target character in the B3 written language sentence, and inserting the target character into a preset character insertion range corresponding to the target character in the deleted B3 written language sentence to obtain a converted C written language sentence. Specifically, the target character "make" is deleted in the B3 written sentence, and the deleted B3 written sentence is obtained: 'the computer is used by me, the Java plug and the Java plug, the speed is fast, and the computer is very convenient and fast to align'. And randomly inserting the target character into the range of the clause to which the target character belongs in the deleted B3 th written language sentence to obtain a C written language sentence after conversion as follows: 'the computer is used by me, the Java plug and the Java plug, the speed is high, and the method is very convenient and fast and is paired'.
Step 522: and (4) performing symbol unit conversion processing on the C written language sentence to obtain a D2 written language sentence.
Specifically, on the basis of performing the character unit conversion processing on the written language sentence to obtain the C written language sentence, the symbol unit conversion processing is performed on the C written language sentence, and the following steps 522-1 to 522-8 are performed as follows:
step 522-1: and determining the deleted symbol conversion processing probability corresponding to the deleted symbol conversion processing strategy in the symbol conversion processing strategy of the C written language sentence.
Specifically, the delete symbol conversion processing policy refers to a processing policy for deleting symbols from written sentences. Accordingly, the erasure symbol conversion processing probability refers to a preset probability for executing the erasure symbol conversion processing policy related processing, and the erasure symbol conversion processing probability may also be preset according to actual experience or spoken language expression habits, for example, the erasure symbol conversion processing probability may be 8% or 12%, and is not limited herein. In the case where the erasure conversion processing probability is 8%, it indicates that the erasure conversion processing policy is executed with a probability of 8% for the written language sentence.
Based on this, it is determined that the erasure conversion processing probability corresponding to the erasure conversion processing policy in the symbol conversion processing policy of the C-th written language sentence is 8%, and the probability of executing the erasure conversion processing policy for the C-th written language sentence is 8%.
Step 522-2: and determining whether to execute a delete symbol conversion processing strategy for the C written language sentence based on the delete symbol conversion processing probability.
Specifically, the specific implementation manner of determining whether to execute the deletion symbol conversion processing policy for the C-th written sentence based on the deletion symbol conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is determined, which is not described herein again.
In specific implementation, if it is determined to execute the delete symbol conversion processing policy, the following step 522-3 is executed; if it is determined that the delete symbol conversion processing policy is not to be executed, the C-th written language sentence is directly used as the D1-th written language sentence, and the following step 522-5 is executed.
Based on this, it is assumed that the delete symbol conversion processing policy is determined to be executed for the C written language sentence based on the delete symbol conversion processing probability of 8%.
Step 522-3: and under the condition that the deletion symbol conversion processing strategy is determined to be executed, symbol sampling is carried out on the C written language sentence according to a preset symbol sampling rule, and a target punctuation mark in the C written language sentence is obtained. Based on the method, under the condition that a strategy for executing the symbol deletion conversion processing is determined, symbol sampling is randomly carried out on the C written language sentence, and a comma of which the target punctuation in the C written language sentence is the first 'fast speed' clause is obtained.
Step 522-4: and deleting the target punctuation marks in the C written language sentence to obtain a converted D1 written language sentence.
Specifically, the comma after the first "fast" clause is deleted in the C written language sentence, and the D1 written language sentence after deletion is obtained: 'the computer is used by me, the Java plug and the Java plug are fast and convenient and are in opposite pairs'.
Step 522-5: and determining the adding symbol conversion processing probability corresponding to the adding symbol conversion processing strategy in the symbol conversion processing strategy of the D1 written language sentence.
The adding symbol conversion processing strategy refers to a processing strategy for adding symbols to written language sentences. Accordingly, the add symbol conversion processing probability refers to a preset probability for executing the processing related to the add symbol conversion processing strategy, and the add symbol conversion processing probability may also be preset according to actual experience or spoken language expression habits, for example, the add symbol conversion processing probability may be 2%, 5%, and the like, and is not limited herein. In the case where the probability of the addition symbol conversion processing is 2%, it is indicated that the addition symbol conversion processing policy is executed with a probability of 2% for the written language sentence.
Based on this, the diacritic conversion processing probability corresponding to the diacritic conversion processing strategy of the D1-th written language sentence is determined to be 2%.
Step 522-6: and determining whether to execute the adding symbol conversion processing strategy for the D1 th written language sentence based on the adding symbol conversion processing probability.
Specifically, the specific implementation manner of determining whether to execute the add symbol conversion processing policy for the D1 th written sentence based on the add symbol conversion processing probability is similar to the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence based on the copy clause conversion processing probability, and the specific implementation manner of determining whether to execute the copy clause conversion processing policy for the written sentence with reference to the copy clause conversion processing probability is not repeated herein.
In specific implementation, if it is determined to execute the add symbol conversion processing strategy, the following step 522-7 is executed; if it is determined that the additional symbol conversion processing strategy is not performed, the D1 th written language sentence is directly used as the D2 th written language sentence, and the following step 524 is performed.
Based on this, it is assumed that the execution of the additional symbol conversion processing policy for the D1 written language sentence is determined based on the additional symbol conversion processing probability of 2%.
Step 522-7: under the condition that the strategy of adding symbol conversion processing is determined to be executed, symbol clause sampling is carried out on the D1 th written language sentence according to a preset symbol clause sampling rule, and a target symbol clause in the D1 th written language sentence is obtained.
Based on this, in the case of determining to execute the strategy of adding symbol conversion processing, the clause with the largest number of sampled characters in the statement of the written language D1 includes "fast speed" and "very convenient", and then one clause is randomly selected from the two clauses to obtain the target character clause, which is "fast speed".
Step 522-8: and randomly inserting preset punctuation marks into the target symbolic clauses to obtain a converted D2 written language sentence, and taking the D2 written language sentence as a target spoken language sentence.
Specifically, the predetermined punctuation mark is "! "in the case of the target symbol clause" fast enough "a predetermined punctuation symbol"! ", the converted D2 written language sentence is obtained as: ' the computer is used by me, the Java plug and the Java plug are fast and fast! And is very convenient, to pair ". The D2 th written language sentence is taken as the target spoken language sentence.
Step 524: and combining the D2 written language sentences corresponding to each written sentence in each text to be processed to obtain the spoken text corresponding to each text to be processed.
Specifically, after the steps 516 to 522 are respectively executed on n written sentences included in the text T to be processed, the target spoken sentence corresponding to each written sentence in the text T to be processed can be obtained, and then the n target spoken sentences are combined according to the sentence order of the n written sentences in the text T to be processed, so as to obtain the spoken text corresponding to the text T to be processed. Wherein, the spoken text comprises the target spoken sentence corresponding to the written sentence 1: ' the computer is used by me, the Java plug and the Java plug are fast and fast! And is very convenient, to pair ".
Step 526: abnormal information in the spoken text is identified.
Specifically, the method includes the steps of recognizing that the abnormal information in the target spoken sentence corresponding to the written sentence 1 in the spoken text is'! "
Step 528: and cleaning the spoken language text according to the abnormal information to obtain the cleaned spoken language text.
In practical application, the spoken language text is subjected to data cleaning according to the abnormal information, and the abnormal information identified in the spoken language text can be filtered or adjusted.
Specifically, based on the exception information! "data cleaning is performed on the target spoken sentence corresponding to the written sentence 1 in the spoken text, and the target spoken sentence corresponding to the written sentence 1 in the spoken text after cleaning is obtained is changed into: ' the computer is used by me, the Java plug and the Java plug are fast! And is very convenient, to pair ".
Step 530: and constructing a sample corpus based on the corresponding relation between the written language text and the translated written language text and the cleaned spoken language text.
In practical applications, the corresponding cleaned spoken text can be obtained for each written language text or the translated written language text. Therefore, a large number of written language texts can be obtained, and the steps 502 to 528 are performed on each written language text, so as to obtain the translated back written language text corresponding to each written language text and the cleaned spoken language text corresponding to the written language texts. And combining the written language text and the cleaned spoken language text which have corresponding relation into a sample corpus pair, and combining the sample corpus pair into a sample corpus.
Further, since each conversion processing strategy has its own conversion processing probability, whether each conversion processing strategy is executed is not fixed in each of the steps 516 to 522, and therefore, the spoken text generated in each of the steps 516 to 522 is different in probability. Based on this, the above steps 516 to 522 can be executed for each text to be processed multiple times, so as to generate a plurality of spoken texts for each text to be processed, thereby further expanding the sample corpus.
Specifically, m written language texts in the sales field are obtained (the m written language texts include the written language text T), the retracing processing from the step 504 to the step 512 is performed on each written language text to obtain m retraced written language texts, the m written language texts and the m retraced written language texts are used as texts to be processed, the step 514 to the step 528 is performed to obtain m cleaned spoken texts corresponding to the m written language texts and m cleaned spoken texts corresponding to the m retraced written language texts, the m written language texts are used as training samples, the m cleaned spoken texts corresponding to the m written language texts are used as sample labels, the m retraced written language texts are used as training samples, and the m cleaned spoken texts corresponding to the m retraced written language texts are used as sample labels to construct sample corpus.
Step 532: and training the initial written language rewriting model through the sample corpora until the written language rewriting model meeting a second training stop condition is obtained.
Specifically, the initial written language rewrite model is trained through the sample corpus, and under the condition that the training meets the preset i iterations, the training is stopped, and the written language rewrite model M1 is obtained.
In addition, on the basis that a lower probability value is set for the multiple conversion processing probabilities in the steps 516 to 522, the steps 516 to 528 are performed again on the written language sentences included in the m written language texts to obtain m cleaned spoken language texts SST corresponding to the m written language texts, and then the basic sample corpus is constructed based on the m written language texts as training samples and the m cleaned spoken language texts SST corresponding to the training samples as sample tags. And training the initial written language rewriting model through basic sample corpora until a written language conversion model M2 meeting a first training stop condition is obtained.
In a specific implementation, the first training stopping condition is a training stopping condition for training the initial written language rewriting model by the basic sample corpus, and the first training stopping condition may be the same as or different from the second training stopping condition, which is not limited herein. The written language conversion model can be understood as a model of a slightly written language rewrite of a spoken text after training. In specific implementation, the written language conversion model adopts simpler sample linguistic data for model training than the sample linguistic data of the written language rewriting model, so the written language conversion model can be used for processing slight sentence rewriting.
Step 534: and acquiring a target spoken language text.
Specifically, a target spoken text T4 is obtained. The target spoken text T4 may be any spoken text in the sales field.
Step 536: and carrying out text classification on the target spoken language text through a text classification model to obtain the type of the spoken language text.
The text classification model refers to a pre-trained model for classifying spoken texts, and in practical applications, the text classification model may be a CNN (convolutional neural network), RNN (cyclic neural network), LSTM (long-term memory network), FastText, TextCNN, HAN model, or the like, which is not limited herein.
In practical application, a large number of spoken texts can be obtained, and the spoken texts are labeled to obtain a text label corresponding to each spoken text, where the text label includes: invalid text type, fuzzy text type, standard text type, etc., without limitation. And then constructing a training sample through the spoken text and the text label corresponding to the spoken text, and training an initial text classification model through the training sample to obtain the trained text classification model.
Specifically, the target spoken language text T4 is subjected to text classification by a text classification model trained in advance, and a text type corresponding to the target spoken language text T4 output by the text classification model is obtained.
Step 538: and deleting the target spoken text in the case that the text type is an invalid text type.
Specifically, if the text type is an invalid text type, the target spoken text T4 may be deleted.
Step 540: and when the text is classified into a standard text type, inputting the target spoken text into the written language rewriting model to rewrite the written language, and obtaining the target written language text output by the written language rewriting model.
Specifically, assuming that the text type is a standard text type, the target spoken text T4 is input to the written language rewrite model M1 and written language rewrite is performed, and the target written language text T5 output by the written language rewrite model M1 is obtained.
Step 542: and under the condition that the text type is the fuzzy text type, inputting the target spoken text into the written language conversion model to rewrite the written language, and obtaining the converted written language text output by the written language conversion model.
Specifically, assuming that the text type is a fuzzy text type, the target spoken text T4 is input to the written-language conversion model M2 and written-language rewriting is performed, and the converted written-language text T6 output by the written-language conversion model M2 is obtained.
In summary, the text processing method provided in the embodiment of the present application obtains the retranslate written text by performing retranslation processing on the written text, so as to expand the original written text with the retranslate written text. On the basis, the conversion processing of clause level, word level, character level and symbol level is carried out on the written language text by the preset conversion processing probability, so that the spoken language text corresponding to the written language text is further expanded. And the expanded written language text and the spoken language text are combined to generate the sample linguistic data, so that the sample linguistic data of the written language rewrite model are automatically generated, the sample linguistic data of the written language rewrite model are enriched, the generation efficiency of the sample linguistic data is improved, and the rewrite accuracy of the written language rewrite model is improved indirectly by enriching the sample linguistic data.
Corresponding to the above method embodiment, the present application further provides a text processing apparatus embodiment, and fig. 6 shows a schematic structural diagram of the text processing apparatus provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:
an obtaining module 602 configured to obtain a target spoken language text;
a classification module 604, configured to classify the target spoken language text to obtain a text type corresponding to the target spoken language text;
a selecting module 606 configured to select a corresponding written language rewriting model according to the standard text type when the text type is the standard text type;
a processing module 608 configured to input the target spoken language text into the written language rewriting model for processing, and obtain a target written language text corresponding to the target spoken language text; the written language rewriting model is obtained by training a spoken language text obtained by retracing and converting the written language text based on the written language text.
Optionally, the text processing apparatus further includes:
the selection model module is configured to select a corresponding written language conversion model according to the fuzzy text type under the condition that the text type is the fuzzy text type;
inputting the target spoken language text into the written language conversion model for processing to obtain a converted written language text corresponding to the target spoken language text; the written language conversion model is obtained by training a basic spoken language text obtained by converting the written language text based on the written language text.
Optionally, the training of the written language conversion model is implemented by operating the following modules:
a first acquisition module configured to acquire written language text;
the first conversion module is configured to perform conversion processing of sentence composition units on the written language text to obtain a basic spoken language text;
the first construction module is configured to construct a basic sample corpus based on the corresponding relation between the written language text and the basic spoken language text;
a first training module configured to train an initial written-to-speech conversion model through the base sample corpus until the written-to-speech conversion model satisfying a first training stop condition is obtained.
Optionally, the classification module 604 is further configured to:
inputting the target spoken language text into a text classification model for classification processing to obtain a text type corresponding to the target spoken language text; the training of the text classification model is realized by operating the following modules:
the system comprises an acquisition sample module, a semantic definition tag and a semantic definition tag, wherein the acquisition sample module is configured to acquire a sample spoken language text and the semantic definition tag corresponding to the sample spoken language text;
a construct sample pair module configured to construct a training sample pair based on the sample spoken text and the semantic definition tags;
and the model training module is configured to perform model training on the initial text classification model through the training sample pair until the text classification model meeting the classification training stopping condition is obtained.
Optionally, the processing module 608 is further configured to:
carrying out sentence splitting processing on the target spoken language text to obtain a sentence sequence contained in the target spoken language text;
sequentially inputting the oral sentence units in the sentence sequence into the coding layer of the written language rewriting model for coding processing to obtain sentence characteristic vectors and vocabulary vectors corresponding to the oral sentence units, wherein the vocabulary vectors are obtained by mapping the oral sentence units and the vocabulary;
and calculating a vector product between the statement feature vector and the word list vector, and inputting the vector product into a decoding layer of the written language rewriting model for decoding to obtain a target written language text corresponding to the target spoken language text.
Optionally, the written language rewrite model is trained by operating the following modules:
a second acquisition module configured to acquire written language text;
the retranslation module is configured to obtain a retranslated written language text corresponding to the written language text by performing retranslation processing on the written language text;
the second conversion module is configured to respectively perform conversion processing of sentence forming units on the written language text and the retraced written language text to obtain a spoken language text;
a second construction module configured to construct a sample corpus based on correspondence between the written language text and the translated back written language text and the spoken language text;
and the second training module is configured to train the initial written language rewriting model through the sample corpus until a written language rewriting model meeting a second training stop condition is obtained.
Optionally, the translation module includes:
the translation sub-module is configured to translate the written language text into a translated text written language text corresponding to a preset language;
the translation sub-module is configured to translate the written translated language text into a target language to which the written language text belongs, and obtain an initial translation written language text;
and the replacing submodule is configured to replace target key words corresponding to the key words in the initial retranslate written language text by the key words in the written language text to obtain a retranslate written language text.
Optionally, the translation module further includes:
the part-of-speech analysis submodule is configured to identify key words of which the parts of speech are preset parts of speech in the written language text by performing part-of-speech analysis on the written language text;
a marking submodule configured to position mark positions of the key words in the written language text;
accordingly, the replacement sub-module is further configured to:
and replacing the target key words corresponding to the position marks in the initial retranslate written language text through the key words to obtain the retranslate written language text.
Optionally, the sentence component unit comprises at least one of: clause unit, word unit, character unit and symbol unit.
Optionally, in a case that the sentence forming unit is a clause unit, the second converting module includes:
the first identification submodule is configured to perform sentence identification on the written language text to be processed to obtain written language sentences contained in the written language text to be processed;
a clause conversion submodule configured to perform clause unit conversion processing on the written language sentence to obtain a converted written language sentence;
a first determination module configured to determine a spoken text based on the converted written language sentence.
Optionally, the clause conversion sub-module is further configured to:
performing clause sampling on the written sentence according to a preset clause sampling rule to obtain a target clause in the written sentence; converting the target clause in the written sentence to obtain a converted written sentence; and/or
Determining clause position probability distribution corresponding to preset clauses contained in a preset clause set; determining a target preset clause and a clause adding position corresponding to the target preset clause in the preset clauses based on the clause position probability distribution; and adding the target preset clause into the written language sentence according to the clause adding position to obtain the converted written language sentence.
Optionally, the clause conversion sub-module is further configured to:
copying the target clause to obtain a copied target clause, and inserting the copied target clause into the written language sentence according to a preset clause inserting position to obtain a converted written language sentence; and/or the presence of a gas in the gas,
deleting the target clause in the written language sentence; inserting the target clause into the deleted written language sentence according to a preset clause insertion rule to obtain a converted written language sentence; and/or
Carrying out syntactic analysis on the target clause to obtain a syntactic structure corresponding to the target clause; and converting the target clause according to the target syntax structure corresponding to the syntax structure to obtain the converted written language sentence.
Optionally, in a case that the sentence component unit is a word unit, the second conversion module includes:
the second identification submodule is configured to perform sentence identification on the written language text to be processed to obtain written language sentences contained in the written language text to be processed;
the distribution determining submodule is configured to determine word position probability distribution corresponding to preset words contained in the preset word set;
the word adding sub-module is configured to determine a target preset word and a word adding position corresponding to the target preset word in the preset words according to the word position probability distribution, insert and add the target preset word into the written language sentence according to the word adding position, and obtain the converted written language sentence;
a second determination module configured to determine a spoken text based on the converted written language sentence.
Optionally, in a case that the sentence component unit is a word unit, the second conversion module is further configured to:
performing sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text;
performing word sampling on words in the written language sentences according to preset word sampling rules to obtain target words in the written language sentences;
deleting the target words in the written sentence, and inserting the target words into a preset insertion range corresponding to the target words in the deleted written sentence to obtain a converted written sentence;
determining spoken text based on the converted written language sentence.
Optionally, the second conversion module is further configured to:
the word copying sub-module is configured to copy the target preset words added in the converted written language sentences to obtain copied words;
the inserting word sub-module is configured to insert the copied words into the converted written language sentences according to preset word inserting rules to obtain the inserted written language sentences;
a third determination module configured to determine the spoken text based on the inserted written language sentence.
Optionally, in a case that the sentence component unit is a character unit, the second conversion module is further configured to:
performing sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed;
carrying out character sampling on characters in the written language sentence according to a preset character sampling rule to obtain target characters in the written language sentence;
deleting the target characters in the written language sentence, and inserting the target characters into a preset character insertion range corresponding to the target characters in the deleted written language sentence to obtain a converted written language sentence;
determining spoken text based on the converted written language sentence.
Optionally, in a case that the sentence component unit is a symbol unit, the second conversion module is further configured to:
performing sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed; symbol sampling is carried out on the written sentence according to a preset symbol sampling rule, a target punctuation mark in the written sentence is obtained, the target punctuation mark is deleted from the written sentence, and the converted written sentence is obtained; determining a spoken text based on the converted written language sentence; and/or
Performing sentence recognition on the written language text to be processed to obtain written language sentences contained in the written language text to be processed; carrying out symbol clause sampling on the written language sentence according to a preset symbol clause sampling rule to obtain a target symbol clause in the written language sentence, and inserting a preset punctuation symbol into the target symbol clause to obtain a converted written language sentence; determining spoken text based on the converted written language sentence.
Optionally, the second conversion module is further configured to:
determining conversion processing probability corresponding to the conversion processing strategy of the written language text to be processed;
determining a target conversion processing strategy to be executed in the conversion processing strategies based on the conversion processing probability;
and performing sentence component unit conversion processing on the written language text by executing the target conversion processing strategy to obtain the spoken language text corresponding to the written language text to be processed.
Optionally, the text processing apparatus further includes:
an identification information module configured to identify abnormal information in the spoken text;
the cleaning module is configured to perform data cleaning on the spoken language text according to the abnormal information to obtain a cleaned spoken language text;
and the sample corpus building module is configured to build a sample corpus based on the corresponding relation between the written language text and the retranslated written language text and the cleaned spoken language text.
Optionally, the second conversion module is further configured to:
performing sentence component unit conversion processing on the written language text to obtain a first spoken language text corresponding to the written language text;
performing sentence component unit conversion processing on the retranslate written language text to obtain a second spoken language text corresponding to the retranslate written language text;
and taking the first spoken text and the second spoken text as the spoken text.
Optionally, the text processing apparatus further includes:
a deletion module configured to delete the target spoken text if the text type is an invalid text type.
The text processing device provided by the embodiment of the application acquires the target spoken language text; classifying the target spoken language text to obtain a text type corresponding to the target spoken language text; under the condition that the text type is a standard text type, selecting a corresponding written language rewriting model according to the standard text type, so that the written language rewriting model suitable for the target spoken language text is selected according to the text type of the target spoken language text; and inputting the target spoken language text into the written language rewriting model for processing to obtain a target written language text corresponding to the target spoken language text, so that written language rewriting is more targeted, and the written language rewriting accuracy is improved. The written language rewriting model is obtained by training the spoken language text obtained by performing retracing and conversion processing on the written language text based on the written language text, and the written language text is preprocessed based on the retracing and conversion processing, so that a large amount of sample corpora of the spoken language text-the written language text are provided for model training, the training difficulty of the model is simplified, time and labor consumption caused by manual work is avoided, a large amount of text data is collected and processed, and time cost and labor cost are saved.
The above is a schematic scheme of a text processing apparatus of the present embodiment. It should be noted that the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the text processing apparatus can be referred to the description of the technical solution of the text processing method.
It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
An embodiment of the present application further provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, and when the processor executes the computer instructions, the steps of the text processing method are implemented.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text processing method.
An embodiment of the present application further provides a computer readable storage medium, which stores computer instructions, and the computer instructions, when executed by a processor, implement the steps of the text processing method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text processing method.
The embodiment of the application discloses a chip, which stores computer instructions, and the computer instructions are executed by a processor to realize the steps of the text processing method.
The above is a schematic scheme of a chip of this embodiment. It should be noted that the technical scheme of the chip and the technical scheme of the text processing method belong to the same concept, and details that are not described in detail in the technical scheme of the chip can be referred to the description of the technical scheme of the text processing method.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (15)

1. A method of text processing, comprising:
acquiring a target spoken language text;
inputting the target spoken language text into a text classification model for classification processing to obtain a predicted text type corresponding to the target spoken language text;
determining a rewriting model corresponding to the target spoken language text according to the predicted text type;
and inputting the target spoken language text into the rewriting model to rewrite the written language, and obtaining the written language text corresponding to the target spoken language text.
2. The method of claim 1, wherein determining the rewrite model for the target spoken text based on the predicted-text type comprises:
under the condition that the predicted text type is a standard text type, selecting a written language rewriting model according to the standard text type;
wherein, the inputting the target spoken language text into the rewriting model to rewrite the written language, and obtaining the written language text corresponding to the target spoken language text includes:
and inputting the target spoken text into the written language rewriting model for processing to obtain a target written language text corresponding to the target spoken text, wherein the written language rewriting model is obtained by training the spoken text obtained by performing retracing and conversion processing on the written language text based on the written language text.
3. The method of claim 1, wherein determining the rewrite model for the target spoken text based on the predicted-text type comprises:
under the condition that the predicted text type is a fuzzy text type, selecting a written language conversion model according to the fuzzy text type;
wherein, the inputting the target spoken language text into the rewriting model to rewrite the written language, and obtaining the written language text corresponding to the target spoken language text includes:
and inputting the target spoken language text into the written language conversion model for processing to obtain a converted written language text corresponding to the target spoken language text, wherein the written language conversion model is obtained by training a basic spoken language text obtained by converting the written language text on the basis of the written language text.
4. The method of claim 3, wherein the training of the written-to-speech model comprises:
carrying out conversion processing of sentence composition units on the written language text to obtain a basic spoken language text;
constructing a basic sample corpus based on the corresponding relation between the written language text and the basic spoken language text;
and training an initial written language conversion model through the basic sample corpus until the written language conversion model meeting a first training stop condition is obtained.
5. The method of claim 1, wherein the training of the text classification model comprises:
acquiring a sample spoken language text and a semantic definition label corresponding to the sample spoken language text;
constructing a training sample pair based on the sample spoken language text and the semantic definition label;
and performing model training on the initial text classification model through the training sample until the text classification model meeting the classification training stopping condition is obtained.
6. The method of claim 2, wherein the inputting the target spoken text into the written language rewrite model for processing to obtain a target written language text corresponding to the target spoken text comprises:
carrying out sentence splitting processing on the target spoken language text to obtain a sentence sequence contained in the target spoken language text;
sequentially inputting the oral sentence units in the sentence sequence into the coding layer of the written language rewriting model for coding processing to obtain sentence characteristic vectors and vocabulary vectors corresponding to the oral sentence units, wherein the vocabulary vectors are obtained by mapping the oral sentence units and the vocabulary;
and calculating a vector product between the statement feature vector and the word list vector, and inputting the vector product into a decoding layer of the written language rewriting model for decoding to obtain a target written language text corresponding to the target spoken language text.
7. The method of claim 2, wherein the written language rewrite model is trained based on written language text and spoken language text obtained by performing a translation and conversion process on the written language text, and comprises:
acquiring a written language text;
obtaining a retranslated written language text corresponding to the written language text by performing retranslation processing on the written language text;
respectively carrying out conversion processing of sentence forming units on the written language text and the retranslated written language text to obtain a spoken language text;
constructing a sample corpus based on the corresponding relation between the written language text and the retranslated written language text and the spoken language text;
and training an initial written language rewriting model through the sample corpus until the written language rewriting model meeting a second training stop condition is obtained.
8. The method according to claim 7, wherein said obtaining a translated back written language text corresponding to the written language text by performing a translation back process on the written language text comprises:
translating the written language text into a translated text written language text corresponding to a preset language;
translating the translated written language text back into the target language to which the written language text belongs to obtain an initial translated written language text;
and replacing target key words corresponding to the key words in the initial retranslate written language text by the key words in the written language text to obtain a retranslate written language text.
9. The method of claim 8, wherein before translating the written language text into a translated written language text corresponding to a predetermined language, the method further comprises:
identifying key words with parts of speech being preset parts of speech in the written language text by analyzing the parts of speech of the written language text;
marking the position of the key word in the written language text;
wherein, the replacing the target key words corresponding to the key words in the initial retranslate written language text by the key words in the written language text to obtain the retranslate written language text comprises:
and replacing corresponding target key words in the initial retranslate written language text by the key words based on the position marks to obtain the retranslate written language text.
10. The text processing method according to claim 4 or 7, wherein the sentence component unit includes at least one of: clause unit, word unit, character unit and symbol unit.
11. The method according to claim 7, wherein the converting process of sentence component units is performed on the written text to be processed by using any one of the written language text and the translated back written language text as the written text to be processed, and the converting process includes:
determining conversion processing probability corresponding to the conversion processing strategy of the written language text to be processed;
determining a target conversion processing strategy to be executed in the conversion processing strategies based on the conversion processing probability;
and performing sentence component unit conversion processing on the written language text to be processed by executing the target conversion processing strategy to obtain the spoken language text corresponding to the written language text to be processed.
12. The method of claim 1, wherein after obtaining the predicted text type corresponding to the target spoken text, the method further comprises:
and deleting the target spoken text under the condition that the predicted text type is an invalid text type.
13. A text processing apparatus, comprising:
an acquisition module configured to acquire a target spoken language text;
the classification module is configured to classify the target spoken language text input text classification model to obtain a predicted text type corresponding to the target spoken language text;
the selection module is configured to determine a rewriting model corresponding to the target spoken language text according to the predicted text type;
and the processing module is configured to input the target spoken language text into the rewriting model to rewrite the written language, and obtain the written language text corresponding to the target spoken language text.
14. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the method of any one of claims 1 to 12.
15. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 12.
CN202210590875.6A 2022-03-16 2022-03-16 Text processing method and device Pending CN114880436A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210590875.6A CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210257335.6A CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device
CN202210590875.6A CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202210257335.6A Division CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device

Publications (1)

Publication Number Publication Date
CN114880436A true CN114880436A (en) 2022-08-09

Family

ID=81094603

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202210257335.6A Pending CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device
CN202210590875.6A Pending CN114880436A (en) 2022-03-16 2022-03-16 Text processing method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202210257335.6A Pending CN114357122A (en) 2022-03-16 2022-03-16 Text processing method and device

Country Status (1)

Country Link
CN (2) CN114357122A (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140358519A1 (en) * 2013-06-03 2014-12-04 Xerox Corporation Confidence-driven rewriting of source texts for improved translation
CN104731775B (en) * 2015-02-26 2017-11-14 北京捷通华声语音技术有限公司 The method and apparatus that a kind of spoken language is converted to written word
CN106354716B (en) * 2015-07-17 2020-06-02 华为技术有限公司 Method and apparatus for converting text
CN110287461B (en) * 2019-05-24 2023-04-18 北京百度网讯科技有限公司 Text conversion method, device and storage medium
CN111666775B (en) * 2020-05-21 2023-08-22 平安科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113822052A (en) * 2020-06-18 2021-12-21 上海流利说信息技术有限公司 Text error detection method and device, electronic equipment and storage medium
CN111737983B (en) * 2020-06-22 2023-07-25 网易(杭州)网络有限公司 Text writing style processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114357122A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN110110054A (en) A method of obtaining question and answer pair in the slave non-structured text based on deep learning
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN115587590A (en) Training corpus construction method, translation model training method and translation method
CN111859950A (en) Method for automatically generating lecture notes
Suleiman et al. Recurrent neural network techniques: Emphasis on use in neural machine translation
CN116483314A (en) Automatic intelligent activity diagram generation method
CN114328848B (en) Text processing method and device
CN112966501B (en) New word discovery method, system, terminal and medium
CN111428475B (en) Construction method of word segmentation word stock, word segmentation method, device and storage medium
CN111090720B (en) Hot word adding method and device
CN115310433A (en) Data enhancement method for Chinese text proofreading
CN114880436A (en) Text processing method and device
Wu A Computational Neural Network Model for College English Grammar Correction
Almansor et al. Transferring informal text in arabic as low resource languages: State-of-the-art and future research directions
Nguyen et al. Text summarization on large-scale Vietnamese datasets
Astuti et al. Code-Mixed Sentiment Analysis using Transformer for Twitter Social Media Data
Shahade et al. Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining
Takahashi Conversion of Noisy or Long Sentences into Readable Sentences
Duo et al. Transition based neural network dependency parsing of Tibetan
Zalmout Morphological Tagging and Disambiguation in Dialectal Arabic Using Deep Learning Architectures
Makarov Neural String Transduction for Morphology, Phonology, and Text Normalization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination