CN116796724A

CN116796724A - Method, apparatus and medium for natural language processing

Info

Publication number: CN116796724A
Application number: CN202310309764.8A
Authority: CN
Inventors: 高德政; 张璐; 陶明; 顾宝宝; 尹顺顺
Original assignee: Shanghai Renyimen Technology Co ltd
Current assignee: Shanghai Renyimen Technology Co ltd
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-09-22

Abstract

Embodiments of the present disclosure relate to a method for natural language processing, comprising: preprocessing is carried out on the acquired target corpus, so that preprocessed corpus is acquired; performing word segmentation processing on the preprocessed corpus to obtain word segmentation samples for generating a model of a pre-training transformation (generating pre-training transformation); distributing the acquired word segmentation samples of the same batch to different computing devices to perform training of the generated pre-training transformation model, so as to acquire gradient calculation results corresponding to each process; and assigning different transformation (transducer) layers of the generated pre-training transformation model with respect to the same word segmentation sample to different computing devices for performing training of the generated pre-training transformation model based on the acquired gradient calculation results, thereby acquiring a trained multi-layer generated pre-training transformation model for generating a target natural language sequence based at least on the trained multi-layer generated pre-training transformation model.

Description

Method, apparatus and medium for natural language processing

Technical Field

Embodiments of the present disclosure relate generally to the field of natural language processing, and more particularly, to a method, apparatus, and medium for natural language processing.

Background

Natural Language processing (Natural Language Processing, NLP) is an important branch of artificial intelligence (Artificial Intelligence, AI) that aims to allow computers to understand and generate Natural Language (Natural Language), i.e. the Language used by humans. Natural language processing has wide application in the field of Online Chat (Online Chat), such as intelligent replies, emotion analysis, dialog generation, content auditing, and the like.

Web chat refers to real-time or non-real-time text communication or other types of communication over the internet or other communication network. Web chat may include, but is not limited to, social media, instant messaging, online forums, email, and the like. The network chat has the following characteristics:

the interactivity is strong: network chat requires that the computer be able to respond to the user's input in a timely manner and give appropriate output;

the diversity is large: network chat involves a variety of topics, scenes and functions, requiring that the computer be able to accommodate different needs and goals;

high randomness: text input by a user in network chat can have grammar errors, spelling errors, punctuation errors and the like, and a computer is required to be fault-tolerant and correct;

context-sensitive: there is a logical, semantic, emotional, etc. connection between user inputs and outputs in network chat, requiring that the computer be able to maintain contextual consistency.

Although the existing natural language processing technology has advanced in the field of network chat, the following problems and disadvantages still exist:

the training efficiency is low: the existing natural language processing model usually requires a large amount of data for training, and the training process is long in time and consumes more resources;

poor quality of the product: the existing natural language processing model may have problems of repetition, redundancy, independence or unreasonability when generating text sequences or other types of sequences;

the adaptability is weak: existing natural language processing models, when faced with new fields or tasks, may not be able to effectively migrate learning or fine tuning and it is difficult to control the generated results.

In summary, the conventional natural language processing technology has the following disadvantages: it is difficult to identify context, lack active learning capabilities, and inability to process multiple languages.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a method for natural language processing. Based on the method and the device provided by the disclosure, language error correction can be realized, for example, spelling and grammar errors can be identified and corrected, and improper punctuation marks and sentence breaks can be repaired; expression optimization, for example, the invention can improve the writing quality by adjusting vocabulary and sentence structure, so that the writing quality is more concise, refined and accurate; style optimization, for example, the present invention can identify and enhance the author's personalized writing style and sound, making it more vivid, attractive and easy to recognize.

According to a first aspect of the present disclosure, there is provided a method for natural language processing, comprising: preprocessing is carried out on the acquired target corpus, so that preprocessed corpus is acquired; performing word segmentation processing on the preprocessed corpus to obtain word segmentation samples for generating a pre-training transformation (GPT) model; distributing the acquired word segmentation samples of the same batch to different computing devices to perform training of the generated pre-training transformation model, so as to acquire gradient calculation results corresponding to each process; and assigning different transformation (transducer) layers of the generated pre-training transformation model with respect to the same word segmentation sample to different computing devices for performing training of the generated pre-training transformation model based on the acquired gradient calculation results, thereby acquiring a trained multi-layer generated pre-training transformation model for generating a target natural language sequence based at least on the trained multi-layer generated pre-training transformation model.

According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the present disclosure.

In a third aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.

In one embodiment, generating the target natural language sequence based at least on the trained multi-layer generation-based pre-training transformation model comprises: a controlled mark or type symbol is inserted into the trained multi-layer generating pre-training transformation model using a predefined word segmentation feature template, thereby directing the trained generating pre-training transformation model to be controlled to generate a target natural language sequence.

In one embodiment, performing preprocessing on the obtained target corpus includes: acquiring the desensitized dialogue data in a preset time period on a target platform; performing data cleaning on the expression, the picture, the non-text and the abnormal data; performing data merging on dialogue data of one person and multiple sentences; and performing preliminary data segmentation on the dialogue data combined by the data according to the interval time of the dialogue data, thereby obtaining a preprocessed corpus.

In one embodiment, performing a tokenization process on the preprocessed corpus includes: performing segmentation on the preprocessed corpus according to the maximum length of a generated pre-training transformation model, thereby constructing a plurality of word segmentation fragments; marking the attribute characteristics and the conversation characteristics of the user according to the conversation content by the constructed word segmentation fragments to generate marked data; inputting the marked data into an audit model, and cleaning the data meeting the preset filtering conditions based on the preset filtering conditions, so as to obtain clean word segmentation data; marking the topic features of the clean word segmentation data based on the topic model; and converting the word segmentation training data of the marked theme features into a memory map file (mmap) data format, thereby reducing memory usage for very large scale data training.

In one embodiment, assigning different segmentation samples of the acquired same batch of segmentations to different computing devices to perform training to generate a pre-training transformation model comprises: dividing the acquired word segmentation of the same batch into a plurality of sub-batches, wherein each sub-batch comprises one or more word segmentation samples; assigning each sub-lot to a different computing device, wherein each computing device comprises one or more processors and a storage unit, the processors being processors that cooperatively calculate for different types or specifications of central processors, graphics processors, tensor processors; performing forward and backward propagation for each sub-batch using a generated pre-trained transformation model on each computing device; and aggregating the gradients of each computing device and updating the generated pre-trained transformation model parameters to obtain gradient calculations corresponding to each process.

In one embodiment, assigning different segmentation samples of the acquired same batch of segmentations to different computing devices to perform training to generate a pre-training transformation model comprises: splitting the generated pre-training transformation model into a plurality of segments, each segment comprising one or more layers; assigning each segment to a different computing device, wherein each computing device includes one or more processors and memory units; transmitting different word segmentation samples in the acquired word segmentation of the same batch to a first computing device, and performing forward propagation on a first segment by using the word segmentation samples; passing the output of the first segment to a second computing device and performing forward propagation on the second segment using the output of the first segment; repeatedly using the output of the current segment to perform forward propagation on the next segment until all segments perform forward propagation on all word segmentation samples; performing a backward propagation on each segment before the next segment using the gradient of the loss function from the next segment; and transferring the gradient of each segment to a previous computing device and updating parameters of the generated pre-trained transformation model.

In one embodiment, the method further comprises: acquiring a pre-trained fine-tuning language model for generating natural language text; receiving a data training set generated by the generated pre-training transformation model and updating parameters of the fine tuning language model by minimizing a loss function; and performing a specific language fine adjustment based on the generated target natural language sequence, thereby generating text consistent with the topic features of the word segmentation corpus.

In one embodiment, inserting a particular marker or type of symbol in the trained multi-layer generated pre-training transformation model further comprises: converting the acquired marked theme features into feature words; splicing the characteristic word and the word sequence input into the generated pre-training transformation model together to input into the acquired multi-layer generated pre-training transformation model and train, wherein the characteristic word does not participate in the loss function operation of the generated pre-training transformation model during training; and generating a natural language sequence with consistency with the feature word segmentation based on the training result.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for natural language processing in accordance with an embodiment of the present disclosure.

Fig. 2 illustrates a flow chart of a method 200 for natural language processing in accordance with an embodiment of the present disclosure.

FIG. 3 illustrates a schematic diagram of data parallelism in accordance with an embodiment of the present disclosure.

FIG. 4 shows a schematic diagram of model parallelism in accordance with an embodiment of the present disclosure.

Fig. 5 shows a model controlled schematic in accordance with an embodiment of the present disclosure.

Fig. 6 illustrates a flow chart of another method 600 of natural language processing data computation in accordance with an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

The text editing function based on the natural language processing technology aims at improving the quality and the readability of the text by modifying sentence structure, changing word part of speech, adjusting language and the like.

Currently, there are many text editing software on the market, such as Microsoft Word, google Docs, grammarly, etc., which provide text correction, grammar checking, and natural language processing. In addition, there are text rendering tools such as Hemingway and proWritingaid which can help users identify long sentences, repeated words, and overuse adverbs, and make suggestions to improve the smoothness and readability of text. However, these solutions have difficulty identifying the context: in some cases, for long sentences or complex sentences, the technique may not correctly recognize the context, resulting in erroneous modifications that change the meaning of the text, and lack active learning capabilities: most of the prior art operates based on limited rules and models, lacks active learning and adaptation capability, cannot process new or unknown language structures, and cannot process multiple languages at the same time: most of the current technology is only aimed at single language, and cannot handle the situation of multiple languages, which limits its development in globalization applications.

The invention is based on natural language processing technology, which comprises text analysis, grammar analysis, part-of-speech tagging, named entity recognition, emotion analysis and other technologies. Through these techniques, the software can understand the structure of sentences, meaning of words, mood of sentences, etc., thereby making modification suggestions. Natural language processing technology is an artificial intelligence technology that can convert natural language text into a form that can be processed by a computer, and analyze and process the same. Natural language processing techniques include text analysis, grammar analysis, part-of-speech tagging, named entity recognition, emotion analysis, and the like. Word vector representation is a technique commonly used in natural language processing that can represent each word as a vector. The word vector representation can capture the semantic and grammatical relations among words, so that good effects are achieved in tasks such as text classification, emotion analysis and the like. Deep learning models are common models in natural language processing, including convolutional neural networks, cyclic neural networks, transform transformers, and the like. These models can perform tasks such as representation learning, classification, generation, etc. on text, and have made significant progress in various natural language processing tasks.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for natural language processing in accordance with an embodiment of the present disclosure. As shown in fig. 1, system 100 includes a computing device 110 and a natural language processing data management device 130 and a network 140. The computing device 110, the natural language processing data management device 130 may interact with data over a network 140 (e.g., the internet).

The natural language processing data management device 130 may perform, for example, a function such as natural language processing calculation. The natural language processing managing device 130 may also send the determined natural language processing data to the computing device 110. The natural language processing data management device 130 may have one or more processing units including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, and general purpose processing units such as CPUs, for example and without limitation: desktop computers, laptop computers, netbook computers, tablet computers, web browsers, e-book readers, personal Digital Assistants (PDAs), wearable computers (such as smartwatches and activity tracker devices), and the like, which may perform chinese data reading and modification.

With respect to computing device 110, it is for example for receiving natural language processing data from natural language processing data management device 130 via network 140; natural language processing computations are implemented on a natural language processing data system. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device 110. In some embodiments, the computing device 110 and the natural language processing data management device 130 may be integrated together or may be separate from each other. In some embodiments, computing device 110 includes, for example, a preprocessing module 112, a word segmentation module 114, a data parallelism module 116, and a pipeline parallelism module 118.

A preprocessing module 112, where the preprocessing module 112 is configured to obtain a target corpus and perform preprocessing on the corpus, thereby obtaining a preprocessed corpus.

A word segmentation module 114, the word segmentation module 114 being configured to perform a word segmentation process on the obtained preprocessed corpus to obtain a word segmentation for the language processing model;

the data parallel module 116 is configured to distribute different word segmentation samples in the same batch of the acquired word segmentation to different devices to perform GPT model training, so as to acquire a gradient calculation result corresponding to each process.

The pipeline parallel module 118 distributes different convertors layers in the same word segmentation sample in the GPT model to different devices to perform GPT model training based on the obtained gradient calculation result, so as to obtain a multi-layer GPT model.

The present disclosure realizes the following technical means through deep learning and natural language processing techniques:

generating a pre-training transformation GPT model: the present disclosure uses a GPT language model trained based on billion-level models+billion-level token data to generate text. Such a model not only learns the grammar and structure of the language, but also understands the context, and thus can generate more fluent and natural text.

Generated Antagonism Network (GANs): GANs is a deep learning technique that trains two neural networks, a generator and a arbiter. The generator will try to generate as realistic text as possible and the arbiter will determine if the text is realistic. By constantly optimizing the competition between the generator and the arbiter, the GANs can generate highly realistic text.

Fine tuning of language model: the present disclosure uses a pre-trained language model, but it can also be adapted to specific text and tasks by fine tuning. For example, in a rendering function, text context and language habits may be better understood by fine-tuning to generate more desirable text.

Prompt controlled technique: the promtt controlled technique is a natural language processing technique that generates natural fluent text by predicting the next word or phrase entered by a user. In the present disclosure, the Prompt technique may be used to optimize and improve language expressions, grammars, intonation, etc. of text, thereby implementing functions such as language correction, article reconstruction, language style improvement, vocabulary replacement, text simplification, etc., making text more natural, readable, accurate, and vivid.

In summary, the inventive and key technical approaches of the present disclosure mainly include language models GPT, GANs, language model refinement, and promt controlled techniques. The combination of the techniques can generate high-quality, natural and smooth text and provide good-quality text color rendering services for users.

Fig. 2 illustrates a flow chart of a method 200 for natural language processing in accordance with an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 shown in fig. 1, or at the electronic device 700 shown in fig. 7. It should be understood that method 200 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.

The computing device 110 uses a GPT model, for example, a GPT3 model. The GPT model consists of multiple layers of transformers. GPT3 only maintains the Decoder structure of the transducer therein, with some modifications to Transformer Decoder. The original Decoder contains two Multi-Head attribute structures, and GPT only retains Mask Multi-Head attribute.

Training objectives of the GPT model specifically, given a non-artificially labeled prediction library w= { W1, W2,..once., wn }, a language model is trained to perform maximum likelihood estimation on parameters:

Number of digitsRandom gradient descent of parameter updating sampling;

the training pair process is to embed words of m words into We plus position into Wp, and input the GPT model by h ₀ Representing that 0 represents the input layer, h0 is calculated as follows:

h ₀ ＝UW _e +W _p

obtaining h ₀ Then sequentially inputting into a plurality of layers of Transformer Decoder to finally obtain h _N 。

N is the number of layers of the transducer, and finally h is obtained _N The next word is predicted to probability.

P(u)＝softmax(h _n )W _e

At step 202, computing device 110 may perform preprocessing on the obtained target corpus, thereby obtaining a preprocessed corpus; .

In one embodiment, the computing device 110 may obtain the desensitized dialog data on the target platform for a predetermined period of time. The conversation data may be, for example, a large amount of network chat data as training data, such as text sequences or other types of sequences that collect user input and output from social media, instant messaging, online forums, email, and the like.

Computing device 110 may perform data cleansing on the surface, the picture, the non-text, and the anomaly data. Data merging is performed on dialogue data of one person for a plurality of sentences and preliminary data segmentation is performed on the dialogue data via data merging at intervals of the dialogue data, thereby obtaining a preprocessed corpus.

In step 204, computing device 110 may perform a tokenization process on the obtained preprocessed corpus to obtain tokens for the language processing model.

In one embodiment, the computing device 110 may perform segmentation on the preprocessed corpus in accordance with a maximum length of the GPT model, thereby constructing a plurality of word segmentation segments. For example, corpus is tokenized (tokenize) into word token samples at a maximum byte length of 512/1024. The computing device 110 may perform segmentation on the preprocessed corpus in accordance with a maximum length of a generated pre-trained transformation model, thereby constructing a plurality of word segmentation fragments (token).

Computing device 110 may tag the user's attribute features and session features with the constructed word segments based on the session content to generate tagged data. The attribute features of the user include basic attributes of the user such as age, gender, occupation, etc. Session features include emotion (positive, negative), mood (humor, angry) and the like of the session, which are related to the content of the session. The attribute features of the user may specifically include: age, gender, hometown, address, emotional state, etc.; the session features specifically include: the time of occurrence of the session, the theme of the session and other characteristics; training data are subjected to emotion models, style models and other models respectively, and the data are marked with emotion, style and other characteristics.

The computing device 110 may input the tagged data to an audit model and based on predetermined filtering conditions, clean the data that meets the predetermined filtering conditions to obtain clean word segmentation data. Computing device 110 may pull the desensitized annual dialogue data on the platform, washing non-text/expression, anomaly data, etc.; and (3) passing the training data through an audit model, setting a violation threshold, cleaning the violation data, and improving the data security.

Computing device 110 may also tag the clean word segmentation data with topic features based on the topic model. And converting the word segmentation training data of the marked theme characteristics into a memory mapping file (mmap) data format, thereby reducing the memory use of the ultra-large-scale data training. For example, the data is subjected to a theme model (emotion and style model), and theme characteristics such as different emotion and style labels are marked on the data; expanding the high-quality content data duty ratio by means of interest entity model identification or keyword library matching and the like; the training data marked with the features is converted into a mmap data format, and the mmap can effectively reduce the memory use of the ultra-large-scale data training when the model is trained.

At step 206, computing device 110 may distribute the same batch of word segmentation samples acquired to different computing devices to perform training of the generated pre-training transformation model, thereby acquiring gradient computation results corresponding to each process.

In one embodiment, computing device 110 may divide the retrieved tokens of the same batch into a plurality of sub-batches, each sub-batch containing one or more token samples. The word segmentation samples obtained in step 204 may be input as a training set to the GPT model for training. Specifically, word segmentation samples may be output to the GPT model in generated batches (batches), with the word segmentation of the same batch (batch) divided into sub-batches, each sub-batch containing one or more word segmentation samples.

The computing device 110 may assign each sub-batch to a different computing device, where each computing device includes one or more processors and memory units, the processors being processors that cooperatively compute for different types or specifications of central processors, graphics processors, tensor processors. For example, the deep ZeRO-Offload mechanism training can be used, thereby utilizing both CPU and GPU memory, allowing a 10-fold model to be trained on a GPU single card. The computing device 110 may also use activation checkpointing techniques to reduce memory utilization while training.

Computing device 110 may perform forward and backward propagation on each sub-batch using the generated pre-trained transformation model on each computing device and aggregate the gradients of each computing device and update the generated pre-trained transformation model parameters to obtain gradient calculations corresponding to each process. In particular, sub-batches of data may be split into different devices for forward and backward propagation and aggregated by an all-reduce algorithm, followed by an opt process to obtain updated gradients.

In one embodiment, computing device 110 uses half-precision, e.g., fp16 half-precision acceleration training. The computing device may accelerate the computation using Sparse Attention kernel techniques, which may support longer input sequences and faster execution with maintained accuracy.

In particular, the computing device 110 may enable data parallelism through torch DistributedDataParallel. FIG. 3 illustrates a schematic diagram of data parallelism in accordance with an embodiment of the present disclosure. As shown in FIG. 3, computing device 110 operates simultaneously with multiple processes, the model broadcasting to processes, each maintaining its own copy of the model weights and data sets for each GPU, each GPU performing forward computation and gradient computation on its own data set, and multiple GPUs synchronizing their respective gradients using all-reduce algorithms, each process applying gradient updates to its local model copy.

At step 208, the computing device 110 may assign different transform (transformer) layers of the generated pre-training transform model for the same word segmentation sample to different computing devices for performing training of the generated pre-training transform model based on the obtained gradient computation results, thereby obtaining a trained multi-layer generated pre-training transform model for generating a target natural language sequence based at least on the trained multi-layer generated pre-training transform model.

In one embodiment, computing device 110 may split the generated pre-trained transformation model into a plurality of segments, each segment containing one or more layers. The computing device 110 divides the GPT model into a plurality of segments. The GPT model uses a transform's Decoder structure, with some modifications to Transformer Decoder, where the original Decoder contains two Multi-Head-state structures, and the GPT only retains Mask Multi-Head-state layers, where a segment includes multiple Multi-Head-state layers.

Computing device 110 may assign each segment to a different computing device, where each computing device includes one or more processors and memory units. The processor is a processor for carrying out cooperative calculation on central processing units, graphic processing units and tensor processing units of different types or specifications. For example, the deep ZeRO-Offload mechanism training can be used, thereby utilizing both CPU and GPU memory, allowing a 10-fold model to be trained on a GPU single card. The computing device 110 may also use activation checkpointing techniques to reduce memory utilization while training.

Computing device 110 may pass different word-segmentation samples of the acquired word-segmentation of the same batch of batch to the first computing device, and use the word-segmentation samples to perform forward propagation on the first segment and pass the output of the first segment to the second computing device, and use the output of the first segment to perform forward propagation on the second segment, thereby enabling model parallelism of forward propagation.

Computing device 110 may reuse the output of the current segment to perform forward propagation on the next segment until all segments perform forward propagation on all word segmentation samples; back propagation is performed on each segment preceding the next segment using the gradient of the loss function from the next segment, thereby passing the gradient of each segment to the previous computing device and updating the parameters of the generated pre-trained transformation model.

In one embodiment, computing device 110 may split the model into different devices through model/pipeline concurrency with the extra memory overhead required for the oversized model. For example, the GPT model is formed by a plurality of Transformer Layer layers, the model can be divided into different segments, and Duan Zaifen is divided into a plurality of layers to be different to the equipment so as to achieve pipline pipelining. The model can divide the calculation of each operator into a plurality of different devices in parallel, so that the memory requirement of each gpu is reduced.

FIG. 4 shows a schematic diagram of model parallelism in accordance with an embodiment of the present disclosure. As shown in fig. 4, the computing device 110 may split the right matrices a and B calculated for two GEMMs of an MLP in Transformer Layer in the GPT model, respectively, in the K-axis and N-axis, and insert an All-Reduce operator before Dropout. That is, passing different word segmentation samples in the same batch of acquired word segments to a first device and performing forward propagation on a first segment using the word segmentation samples; passing the output of the first segment to a second device and performing forward propagation on the second segment using the output of the first segment; repeating the process until all segments perform forward propagation on all word segmentation samples; performing backward propagation on each segment preceding the next segment by inserting All-Reduce operators using the gradient of the loss function from the next segment; and finally, transferring the gradient of each segment to the previous device and updating the GPT model parameters.

In one embodiment, the method further optionally includes step 210. At step 210, the computing device 110 may insert controlled markers or type symbols in the trained multi-layer generated pre-training transformation model using the predefined word segmentation feature templates, thereby directing the trained generated pre-training transformation model to be controlled to generate the target natural language sequence.

In one embodiment, computing device 110 may convert the captured, tagged subject features into feature words. Fig. 5 shows a model controlled schematic in accordance with an embodiment of the present disclosure. As shown in FIG. 5, the token feature and dialog content token are stitched together and put into the trained GPT model. For example, the user feature is a token feature.

The computing device 110 concatenates the feature word with the word sequence input into the generated pre-training transformation model for input into the acquired multi-layer generated pre-training transformation model and training, wherein the feature word does not participate in the loss function operation of the generated pre-training transformation model during training. The computing device 110 may splice the session with the token feature to perform GPT training. And finally, generating a natural language sequence consistent with the feature word segmentation based on the training result. For example, based on the user's happy instruction, i am happy today, a reply may be generated, "why happy woolen".

In one embodiment, the computing device 110 may add controlled features, such as token features, while modeling is controlled. In the training stage, token feature does not participate in model loss calculation, and influence is eliminated through loss mask. The feature addition mode can keep model pre-training and Fine-tuning consistency, and the subsequent feature expansion processing mode is relatively uniform.

Based on the attribute features acquired by the method described above, a feature controlled language sequence whose content/length can be attributed to the session and user dimensions can be generated.

Also based on emotion, style and the like, the method can be attributed to msg dimension feature control in a session, and various feature token and text token are assembled together and put into a model to complete Fine-tuning training.

The user attribute, style, emotion and other characteristics can be transmitted to the GPT model according to the instruction mode, so that the controlled effect can be achieved.

Fig. 6 illustrates a flow chart of another method 600 of natural language processing data computation in accordance with an embodiment of the present disclosure. The method 300 may be performed by the computing device 110 shown in fig. 1, or at the electronic device 700 shown in fig. 7. It should be understood that method 300 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.

At step 602, computing device 110 may obtain a pre-trained fine-tuning language model for generating natural language text;

in one embodiment, computing device 110 may use a pre-trained Fine-tuning language model for generating natural language text that may be tailored to specific text and tasks through Fine-tuning (Fine-tuning) techniques.

At step 604, computing device 110 may receive the data training set generated by the generated pre-training transformation model and update parameters of the fine-tuning language model by minimizing a loss function;

at step 606, computing device 110 may perform a particular language tuning based on the generated target natural language sequence, thereby generating text consistent with the subject features of the word segmentation corpus.

In one embodiment, the computing device 110 may obtain a natural language sequence output through the above GPT model, better understand text context and language habit by performing language fine-tuning learning, thereby generating more satisfactory text.

In summary, the present disclosure provides a billion-level parametric ultra-large model GPT (generating Pre-trained Transformer) architecture: the ultra-large model GPT with the billion-level parameter quantity is a deep neural network based on a transducer architecture, and high-quality natural language text is generated by pre-training billion-level token data. The Prompt technique: in the present disclosure, the promtt technology is largely used to optimize and improve the text generation process, and to adjust constraint language expressions, grammar, intonation, etc., so as to implement functions of language correction, article reconstruction, language style improvement, vocabulary replacement, text simplification, etc., so that the text is more natural, readable, accurate, and vivid. Fine tuning of language model Fine-tuning technique: in addition to pre-training, the present disclosure also uses Fine-tuning techniques to apply pre-trained models to specific tasks, such as style-rendering, etc. Iterative learning: the present disclosure employs iterative learning methods to continuously update models to continuously improve accuracy and performance of the models.

In general, the biggest innovation of the present disclosure is to fuse together multiple advanced NLP techniques and to use iterative learning methods to continuously improve model performance. This makes this disclosure a powerful word moisturizing tool that can produce high quality, natural fluency, language text that meets various requirements.

Fig. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present inventive content. For example, computing device 110 as shown in fig. 1 may be implemented by electronic device 700. As shown, the electronic device 700 includes a Central Processing Unit (CPU) 701 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 702 or loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the random access memory 703, various programs and data required for the operation of the electronic device 700 may also be stored. The central processing unit 701, the read only memory 702, and the random access memory 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the input/output interface 705, including: an input unit 706 such as a keyboard, mouse, microphone, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The various processes and treatments described above, such as method 200, may be performed by central processing unit 701. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via read only memory 702 and/or communication unit 709. One or more of the acts of the method 200 described above may be performed when a computer program is loaded into random access memory 703 and executed by central processing unit 701.

The present disclosure relates to methods, apparatus, systems, electronic devices, computer readable storage media, and/or computer program products. The computer program product may include computer readable program instructions for performing various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge computing devices. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be appreciated by those of ordinary skill in the art that the present disclosure is not limited to the embodiments described above, but may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the illustrated examples and embodiments are to be considered as illustrative and not restrictive, and the disclosure is intended to cover various modifications and substitutions without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A method for natural language processing, comprising:

preprocessing is carried out on the acquired target corpus, so that preprocessed corpus is acquired;

performing word segmentation processing on the preprocessed corpus to obtain word segmentation samples for generating a pre-training transformation (GPT) model;

distributing the acquired word segmentation samples of the same batch to different computing devices to perform training of the generated pre-training transformation model, so as to acquire gradient calculation results corresponding to each process; and

based on the obtained gradient calculation results, different transformation (transducer) layers of the generated pre-training transformation model for the same word segmentation sample are distributed on different computing devices to perform training of the generated pre-training transformation model, thereby obtaining a trained multi-layer generated pre-training transformation model for generating a target natural language sequence based at least on the trained multi-layer generated pre-training transformation model.

2. The method of claim 1, wherein generating a target natural language sequence based at least on the trained multi-layer generation-based pre-training transformation model comprises:

a controlled mark or type symbol is inserted into the trained multi-layer generating pre-training transformation model using a predefined word segmentation feature template, thereby directing the trained generating pre-training transformation model to be controlled to generate a target natural language sequence.

3. The method of claim 1, wherein performing preprocessing on the obtained target corpus comprises:

acquiring the desensitized dialogue data in a preset time period on a target platform;

performing data cleaning on the expression, the picture, the non-text and the abnormal data;

performing data merging on dialogue data of one person and multiple sentences; and

according to the interval time of the dialogue data, preliminary data segmentation is performed on the dialogue data via data merging, thereby obtaining a preprocessed corpus.

4. A method according to claim 3, wherein performing a tokenization process on the preprocessed corpus comprises:

performing segmentation on the preprocessed corpus according to the maximum length of a generated pre-training transformation model, thereby constructing a plurality of word segmentation fragments;

Marking the attribute characteristics and the conversation characteristics of the user according to the conversation content by the constructed word segmentation fragments to generate marked data;

inputting the marked data into an audit model, and cleaning the data meeting the preset filtering conditions based on the preset filtering conditions, so as to obtain clean word segmentation data;

marking the topic features of the clean word segmentation data based on the topic model; and

and converting the word segmentation training data of the marked theme characteristics into a memory mapping file (mmap) data format, thereby reducing the memory use of the ultra-large-scale data training.

5. The method of claim 1, wherein assigning different word segmentation samples of the same batch of acquired word segmentations to different computing devices to perform training to generate a pre-training transformation model comprises:

dividing the acquired word segmentation of the same batch into a plurality of sub-batches, wherein each sub-batch comprises one or more word segmentation samples;

assigning each sub-lot to a different computing device, wherein each computing device comprises one or more processors and a storage unit, the processors being processors that cooperatively calculate for different types or specifications of central processors, graphics processors, tensor processors;

Performing forward and backward propagation for each sub-batch using a generated pre-trained transformation model on each computing device; and

the gradients of each computing device are aggregated and the generated pre-trained transformation model parameters are updated to obtain gradient calculations corresponding to each process.

6. The method of claim 1, wherein assigning different word segmentation samples of the same batch of acquired word segmentations to different computing devices to perform training to generate a pre-training transformation model comprises:

splitting the generated pre-training transformation model into a plurality of segments, each segment comprising one or more layers;

assigning each segment to a different computing device, wherein each computing device includes one or more processors and memory units;

transmitting different word segmentation samples in the acquired word segmentation of the same batch to a first computing device, and performing forward propagation on a first segment by using the word segmentation samples;

passing the output of the first segment to a second computing device and performing forward propagation on the second segment using the output of the first segment;

repeatedly using the output of the current segment to perform forward propagation on the next segment until all segments perform forward propagation on all word segmentation samples;

Performing a backward propagation on each segment before the next segment using the gradient of the loss function from the next segment; and

the gradient of each segment is passed to the previous computing device and the parameters of the generated pre-trained transformation model are updated.

7. A method according to claim 3, characterized in that the method further comprises:

acquiring a pre-trained fine-tuning language model for generating natural language text;

receiving a data training set generated by the generated pre-training transformation model and updating parameters of the fine tuning language model by minimizing a loss function; and

based on the generated target natural language sequence, specific language fine tuning is performed, so that text consistent with the topic features of the word segmentation corpus is generated.

8. The method of claim 4, wherein inserting a particular marker or type of symbol in the trained multi-layer generated pre-training transformation model further comprises:

converting the acquired marked theme features into feature words;

splicing the characteristic word and the word sequence input into the generated pre-training transformation model together to input into the acquired multi-layer generated pre-training transformation model and train, wherein the characteristic word does not participate in the loss function operation of the generated pre-training transformation model during training; and

Based on the training result, a natural language sequence with consistency with the feature word segmentation is generated.

9. A computing device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.