CN115952461A

CN115952461A - Pre-training corpus cleaning method, system and storage medium

Info

Publication number: CN115952461A
Application number: CN202310249970.4A
Authority: CN
Inventors: 华菁云; 周同; 王宇龙; 周明
Original assignee: Beijing Lanzhou Technology Co ltd
Current assignee: Beijing Lanzhou Technology Co ltd
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-04-11

Abstract

The invention relates to the technical field of natural language processing, in particular to a method, a system and a storage medium for cleaning pre-training corpus, comprising the following steps: the method comprises the steps of obtaining a preset corpus, wherein the preset corpus comprises a plurality of corpora, segmenting the plurality of corpora based on a preset method to obtain a plurality of segmented corpora, cleaning the plurality of segmented corpora to obtain a plurality of cleaned segmented corpora, and judging whether each segmented cleaned corpora meets preset conditions; if so, retaining the segmented and cleaned corpus meeting the preset condition, otherwise, discarding the segmented and cleaned corpus not meeting the preset condition, and splicing the segmented and cleaned corpus required to be retained in each corpus to obtain the cleaned pre-training corpus. The low-quality junk texts are filtered, the noise generated during the training of the linguistic data and the total amount of the training linguistic data of the pre-training model are reduced, the model training efficiency is improved, and the problems that the data noise is high during training and the training efficiency is low are solved.

Description

Pre-training corpus cleaning method, system and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for cleaning pre-training corpus and a storage medium.

Background

With the increasingly widespread application of the pre-training model technology of natural language processing technology in real life, such as dialogue systems, question-answering systems, marketing document generation, novel continuous writing, document generation, translation systems, etc., more and higher-quality corpora are required to be used for pre-training the model. The corpus preprocessing is used as the first step of a pre-training technology, is the basis of pre-training of a subsequent model, and is of great importance. If the corpus is not properly processed, a large amount of noise is included, such as a large amount of grammar error, discordant text, advertisements, webpage codes, and the like, which can seriously affect the quality and efficiency of the model pre-training.

Today, in the information age, there are rich linguistic resources on networks and social media. In order to better utilize the corpus data to perform subsequent natural language related research, such as text classification, question answering, conversation, reading understanding, entity extraction and the like, the text data of different sources and different forms need to be preprocessed to obtain clean high-quality corpus. In the prior art, the materials are generally cleaned by adopting a rule matching mode based on regular and the like, but the method has the defects of high labor cost and need of manually compiling a large number of rules and templates; the adaptability and flexibility are poor, the written rules and threshold values are completely relied on during cleaning, semantic information is not considered during processing of the corpus, the corpus is directly cut, the generalization is poor, a large amount of invalid and harmful information is still left and is not cleaned, and the problem that the pre-training efficiency and the pre-training quality are influenced due to excessive data noise in large-scale pre-training anticipation exists.

Disclosure of Invention

In order to solve the problems of more data noise and lower efficiency in training in the prior art, the invention provides a method and a system for cleaning pre-training corpora and a storage medium.

The technical scheme for solving the technical problem is to provide a method for cleaning pre-training corpora, which comprises the following steps:

acquiring a preset corpus, wherein the preset corpus comprises a plurality of corpora, segmenting the plurality of corpora based on a preset method to obtain a plurality of segmented corpora, and each corpus can be segmented to obtain one or more segmented corpora;

cleaning a plurality of segmented corpora to obtain a plurality of cleaned segmented corpora, and judging whether each segmented cleaned corpora meets a preset condition;

if so, retaining the segmented and cleaned corpus which meets the preset condition, and if not, discarding the segmented and cleaned corpus which does not meet the preset condition;

and splicing the segmented and cleaned corpora which need to be reserved in each corpus to obtain the cleaned pre-trained corpora.

Preferably, the method for obtaining a plurality of corpora includes the following steps:

acquiring a preset corpus, wherein the preset corpus comprises a plurality of corpora;

traversing each corpus in a preset corpus, taking any one corpus in the preset corpus as a current corpus, and judging whether the length of the current corpus is smaller than or equal to the preset length;

if so, taking the current corpus as the segmented corpus;

if not, segmenting the current corpus by taking a period as a partition to obtain a plurality of single sentences of the current corpus, splicing the plurality of single sentences of the current corpus in sequence to obtain a plurality of spliced texts, wherein the length of the spliced texts is less than or equal to the preset length, and recording the plurality of spliced texts as a plurality of segmented corpora.

Preferably, the preset length ranges from 64 characters to 512 characters.

Preferably, the corpus is cleaned through a part-of-speech tagging feature module and a text representation module.

Preferably, the step of cleaning the plurality of segmented corpora to obtain the plurality of cleaned segmented corpora, and the step of judging whether the plurality of segmented cleaned corpora satisfy the preset condition specifically includes the steps of:

respectively inputting a plurality of segmented corpora into a part-of-speech tagging feature module and a text representation module to respectively obtain M-dimensional part-of-speech feature vectors and N-dimensional text representation vectors;

splicing the M-dimensional part-of-speech feature vector and the N-dimensional text representation vector of each segmented corpus in a preset splicing mode to obtain M + N-dimensional vector representation of each segmented corpus;

and representing the M + N-dimensional vector of each divided corpus to an input tree model classifier, judging whether the corpus meets a preset condition or not by the aid of the output score of the tree model classifier, if the output score is 1, meeting the preset condition, and if the output score is 0, not meeting the preset condition.

Preferably, the tree model classifier is a trained LightGBM model.

Preferably, the preset splicing mode is a Concatenate splicing mode.

Preferably, the loss function for training the tree model classifier is:

where N is the number of samples, yi is the actual label of the ith sample, and pi is the prediction probability that the ith sample is predicted to be 1.

The invention also provides a pre-training corpus cleaning system for solving the technical problems, which comprises the following modules:

an acquisition module: the method comprises the steps that the method is used for presetting a corpus, wherein the preset corpus comprises a plurality of corpora, and the plurality of corpora are segmented based on a preset method to obtain a plurality of segmented corpora;

a cleaning module: the system is used for cleaning a plurality of segmented corpora and judging whether each segmented cleaned corpora meets a preset condition or not;

a processing module: the language material processing device is used for taking the segmented and cleaned language material meeting the preset condition as the current language material, and if not, discarding the segmented and cleaned language material not meeting the preset condition;

splicing modules: and the method is used for splicing the segmented and cleaned linguistic data which need to be reserved in each linguistic data to obtain the cleaned pre-training linguistic data.

The present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement any one of the above pre-training corpus cleaning methods.

Compared with the prior art, the method, the system and the storage medium for cleaning the pre-training corpus, disclosed by the invention, have the following advantages:

1. the method comprises the steps of obtaining a plurality of corpora in a preset corpus, segmenting the plurality of corpora based on a preset method to obtain a plurality of segmented corpora, wherein each corpus can be segmented to obtain one or more segmented corpora, cleaning the segmented corpora to obtain a plurality of cleaned segmented corpora, judging whether each segmented cleaned corpora meets preset conditions, if so, retaining the segmented cleaned corpora meeting the preset conditions, and if not, discarding the corpora, splicing the segmented cleaned corpora which need to be retained in each corpus to obtain the cleaned pre-training corpora.

2. In the pre-training corpus cleaning method, each corpus in a preset corpus is traversed, any one corpus in the preset corpus is taken as a current corpus, the length of the current corpus is judged, whether the length of the current corpus is smaller than or equal to the preset length or not is judged, if yes, the corpus does not need to be segmented, and the current corpus smaller than or equal to the preset length can be directly taken as the segmented corpus; if not, segmenting the current corpus by taking the period number as a partition to obtain a plurality of single sentences of the current corpus, splicing the plurality of single sentences in sequence to obtain a plurality of spliced texts, wherein the spliced texts need to meet the condition of being less than or equal to the preset length, splicing the redundant part to obtain another spliced text, and ensuring that the spliced text is less than or equal to the preset length.

3. The linguistic data is cleaned through a part-of-speech tagging feature module and a text representation module, when tagging is carried out, the segmented linguistic data is input into the part-of-speech tagging feature module and the text representation module at the same time, the part-of-speech tagging feature module is used for carrying out part-of-speech tagging operation on the segmented linguistic data and counting various part-of-speech tagging features in the segmented linguistic data, the text representation module is based on a Mengzi-BERT-large model, the hidden state of the last hidden layer of the Mengzi-BERT-large model is used as the output of the text representation module, and the part-of-speech tagging feature module and the output of the text representation module are spliced and used as the input of a subsequent tree model classifier for judging whether the linguistic data needs to be reserved or not.

4. The method for cleaning the pre-training corpus comprises the steps of simultaneously and respectively inputting segmented corpora into a part-of-speech tagging feature module and a text representation module to respectively obtain M-dimensional part-of-speech feature vectors and N-dimensional text representation vectors, splicing the M-dimensional part-of-speech feature vectors and the N-dimensional text representation vectors corresponding to the segmented corpora in a preset mode to obtain M + N-dimensional vector representations of each corpus, and finally inputting the M + N-dimensional vector representations of each corpus into a trained tree model classifier model for judgment to obtain which corpora need to be reserved and which need not to be reserved, and judging by the tree model classifier to filter low-quality junk texts and reduce the total amount of the corpora during pre-training, so that the pre-training efficiency of the model is improved.

5. The trained LightGBM model has the advantages of high training speed and accuracy and capability of automatically adjusting the decision tree structure, and the trained LightGBM model can process a large amount of data, so that the speed of filtering low-quality texts can be increased, and the noise of linguistic data during subsequent pre-training can be reduced.

6. The concatemate splicing mode can splice the M-dimensional part-of-speech feature vector and the N-dimensional text representation vector of each corpus together, so that the M-dimensional part-of-speech feature vector and the N-dimensional text representation vector can be input into a preset model together for training, the operation is convenient, and the accuracy of the model can be improved.

7. The invention also provides a pre-training corpus cleaning system and a storage medium, which have the same beneficial effects as the pre-training corpus cleaning method, and are not repeated herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flowchart illustrating steps of a method for pre-corpus cleaning according to a first embodiment of the present invention.

FIG. 2 is a flowchart illustrating the step S1 of the pre-corpus cleaning method according to the first embodiment of the present invention.

FIG. 3 is a flowchart illustrating a step S2 of a method for pre-corpus cleaning according to a first embodiment of the present invention.

FIG. 4 is a block diagram of a pre-corpus cleaning system according to a second embodiment of the present invention.

Description of the figures:

1. pre-training a corpus cleaning system;

10. an acquisition module; 20. a cleaning module; 30. a processing module; 40. and (5) splicing the modules.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, a first embodiment of the present invention provides a method for cleaning pre-training corpus, including the following steps:

s1: acquiring a preset corpus, wherein the preset corpus comprises a plurality of corpora, segmenting the plurality of corpora based on a preset method to obtain a plurality of segmented corpora, and each corpus can be segmented to obtain one or more segmented corpora;

s2: cleaning a plurality of segmented corpora to obtain a plurality of cleaned segmented corpora, and judging whether each segmented cleaned corpora meets a preset condition;

s3: if so, retaining the segmented and cleaned corpus meeting the preset condition, and if not, discarding the segmented and cleaned corpus not meeting the preset condition.

S4: and splicing the segmented and cleaned corpora which need to be reserved in each corpus to obtain the cleaned pre-trained corpora.

It can be understood that, in the steps of the invention, a plurality of linguistic data are taken first, and the plurality of linguistic data are segmented simultaneously based on the preset length to obtain the linguistic data with the required length, and the linguistic data meeting the length requirement are cleaned, so that ineffective and harmful text contents with low quality can be filtered, and the noise during the training of the linguistic data is reduced, thereby being beneficial to the pre-training of a subsequent model, and improving the quality of the subsequent pre-training model.

Referring to fig. 2, step S1 specifically includes the following steps:

s11: acquiring a preset corpus, wherein the preset corpus comprises a plurality of corpora;

s12: traversing each corpus in a preset corpus, taking any one corpus in the preset corpus as a current corpus, and judging whether the length of the current corpus is less than or equal to the preset length;

s13: if so, taking the current corpus as the segmented corpus;

s14: if not, segmenting the current corpus by taking a period as a partition to obtain a plurality of single sentences of the current corpus, splicing the plurality of single sentences of the current corpus in sequence to obtain a plurality of spliced texts, wherein the length of the spliced texts is less than or equal to the preset length, and recording the plurality of spliced texts as a plurality of segmented corpora.

It can be understood that, in the steps of the present invention, each corpus in the preset corpus is traversed, and any one corpus in the preset corpus is taken as the current corpus, the length of the current corpus is judged, whether the length of the current corpus is smaller than or equal to the preset length is judged, if yes, the corpus does not need to be segmented, and the current corpus smaller than or equal to the preset length can be directly taken as the segmented corpus; if not, segmenting the current corpus by taking the period as a partition to obtain a plurality of single sentences of the current corpus, splicing the plurality of single sentences in sequence to obtain a plurality of spliced texts, wherein the spliced texts need to meet the condition of being less than or equal to the preset length, splicing redundant parts to obtain another spliced text, and ensuring that the spliced text is less than or equal to the preset length.

It can be understood that the corpus cleaned by the method for cleaning the pre-trained corpus according to the present invention is compatible with the pre-trained models, including but not limited to BERT, T5, GPT, etc.

As an alternative embodiment, the predetermined length may range from 64 characters to 512 characters.

Optionally, the preset length may be 64 characters, 128 characters, 256 characters or 512 characters, and other character lengths in the 64-512 characters may also be selected according to the actual application scenario.

According to an embodiment of the invention, if the length of the source text is 600 characters, splitting the source text by taking a period as a boundary, and splitting to obtain four sentences, wherein the respective lengths are 160 characters, 130 characters, 200 characters and 110 characters, if the preset length is 512 characters, splicing the split sentences according to the language order of the source text, and because the preset length is 512 characters, the length of the text needing to be spliced is less than or equal to 512 characters, and because 160+130+200=490 characters, 490 characters < 512 characters, 160+130+200= 110 characters >512 characters, namely we splice the first three sentences of the source text according to the sequence to obtain a first spliced text, use the fourth sentence of the source text as a second spliced text, and splice according to the original sequence to ensure that the spliced text semantics are close to the original text, so that the semantic understanding capability of the model is improved when the model is trained.

Furthermore, the corpus is cleaned through a part-of-speech tagging feature module and a text representation module.

The linguistic data is cleaned through a part-of-speech tagging feature module and a text representation module, when tagging is carried out, the segmented linguistic data is simultaneously input into the part-of-speech tagging feature module and the text representation module, the part-of-speech tagging feature module carries out part-of-speech tagging operation on the segmented linguistic data, various part-of-speech tagging features in the segmented linguistic data are counted, the text representation module is based on a Mengzi-BERT-large model, the hidden state of the last hidden layer of the Mengzi-BERT-large model is used as the output of the text representation module, and the part-of-speech tagging feature module and the output of the text representation module are spliced and used as the input of a subsequent tree model classifier and used for judging whether the linguistic data are linguistic data needing to be reserved.

It should be noted that the text characterization module may be based on one of a Mengzi-BERT-large model, a Mengzi-BERT-base model, or a Mengzi-BERT-3B model. Preferably, in the embodiment of the present invention, the preset model is a Mengzi-BERT-large model. Compared with the Mengzi-BERT-large model, the Mengzi-BERT-base model can reduce consumption of hardware resources, and the Mengzi-BERT-3B model can obtain higher extraction accuracy.

It should be noted that the part-of-speech features are ten in total, and are specifically as follows:

furthermore, the part-of-speech tagging feature module in the embodiment of the present invention is tagged with a part-of-speech tagging model in a Hanlp (Hanlp), that is, in the embodiment of the present invention, an M-dimensional vector is obtained through the part-of-speech tagging model in the Hanlp model, the text characterization module adopts a BERT model, and a hidden state of a last hidden layer of the BERT model is taken as an output to be an N-dimensional vector.

Referring to fig. 3, step S2 specifically includes the following steps:

s21: respectively inputting a plurality of segmented corpora into a part-of-speech tagging feature module and a text representation module to respectively obtain M-dimensional part-of-speech feature vectors and N-dimensional text representation vectors;

s22: splicing the M-dimensional part-of-speech feature vector and the N-dimensional text representation vector of each segmented corpus in a preset splicing mode to obtain M + N-dimensional vector representation of each segmented corpus;

s23: and (3) representing the M + N-dimensional vector of each divided corpus to input a tree model classifier, judging whether the corpus meets the preset condition or not by the output score of the tree model classifier, if the output score is 1, meeting the preset condition, and if the output score is 0, not meeting the preset condition.

It should be noted that, a complete corpus cleaning process is as follows: the segmented corpora are simultaneously input into a part-of-speech tagging feature module and a text representation module to respectively obtain a part-of-speech feature vector M and a text representation vector N dimension, the part-of-speech feature vector M and the text representation vector N dimension of the corpora are combined in a preset combining mode to obtain M + N dimension vector representation of each corpora, then the M + N dimension vector representation of each corpora belongs to a tree model classifier module after training for secondary classification to obtain the corpora which need to be retained and the corpora which do not need to be retained, and the segmented corpora after cleaning is obtained.

It should be noted that, if the output of the tree model classifier is 1, it indicates that the corpus is a valid text and is a non-junk text that needs to be reserved, and if the output of the tree model classifier is 0, it indicates that the corpus is a non-valid text and does not need to be reserved.

The method comprises the steps of firstly, inputting segmented corpora into a part-of-speech tagging feature module and a text representation module at the same time to obtain M-dimensional part-of-speech feature vectors and N-dimensional text representation vectors respectively, splicing the M-dimensional part-of-speech feature vectors and the N-dimensional text representation vectors corresponding to the segmented corpora in a preset mode to obtain M + N-dimensional vector representation of each corpus, and finally inputting the M + N-dimensional vector representation of each corpus into a trained tree model classifier model for judgment to obtain which corpora need to be reserved and which do not need to be reserved, and judging by the tree model classifier to filter low-quality junk texts and reduce the total amount of the corpora during pre-training, thereby improving the pre-training efficiency of the model.

In the embodiment of the present invention, M is 10, and n is 1024.

As an optional embodiment, the tree model classifier module is a trained LightGBM model.

It can be understood that the trained LightGBM model of the present invention has the advantages of fast training speed, high accuracy, and automatic adjustment of the decision tree structure, and the trained LightGBM model can process a large amount of data, thereby speeding up the filtering of low quality text and reducing the noise of corpus during the subsequent pre-training.

It should be noted that, in the embodiment of the present invention, the trained LightGBM model of the tree model classifier is used to perform two-classification judgment on the segmented corpora to determine whether to reserve or not, where the output of the reserved corpus is 1 and the output of the unreserved corpus is 0, that is, the M + N-dimensional vector corresponding to each corpus represents that the corpus to be reserved and is output as 1 can be obtained by the input to the tree model classifier module, and then the corpus to be reserved and output as 1 is subjected to subsequent training.

Further, the LightGBM model in the embodiment of the present invention needs to be pre-trained, and the training process is as follows: providing a thousand pieces of labeled data, dividing the data into reserved data and unreserved data, performing text segmentation on the data to obtain a data set with a preset length, then simultaneously inputting a part-of-speech labeling feature model and a text representation module to obtain an M + N-dimensional vector of each piece of data, inputting the M + N-dimensional vector of each piece of data into a LightGBM model for training, judging whether the data output by a tree model classifier is consistent with a real label, and further training the model.

It should be noted that the reserved text is a valid text, and the text that is not required to be reserved is a non-valid text.

As an alternative, the preset mode is a splice mode of Concatenate.

It can be understood that the concatemate splicing mode in the embodiment of the present invention may splice together the M-dimensional part-of-speech feature vector and the N-dimensional text representation vector of each corpus, so that the M-dimensional part-of-speech feature vectors and the N-dimensional text representation vectors may be input into a preset model together for training, and the present invention is convenient to operate and can also improve the accuracy of the model.

It should be noted that the concatemate concatenation mode is to concatenate the M-dimensional part-of-speech feature vector and the N-dimensional part-of-speech feature vector of each corpus.

Further, an embodiment of the present invention, for example, the part-of-speech tagging vector is a M = 10-dimensional vector [0.32, 0.51, 0.23, 0.12, 0.45, 0.01, 0.12, 0.43, 0.45, 0.21], the text characterization vector is a N = 1024-dimensional vector [ -0.0106, -0.0101, -0.0144, -0.0115,. ], -0.0116, -0.0173, -0.0071, -0.0083, -0.0070], the concatenate concatenation is [0.32, 0.51, 0.23, 0.12, 0.45, 0.01, 0.12, 0.43, 0.45, 0.21, -0.0106, -0.0101, -0.0144, -0.0115, ], -0.0116, -0.0173, -0.0071, -0.0070.0070.0075, -0.0075.

As an alternative embodiment, the loss function for training the tree model classifier is:

It can be understood that the loss function is used to measure the difference between the predicted value and the true value of the preset model, so as to optimize the parameters of the preset model to achieve the best prediction effect.

In one embodiment of the present invention, if the length of the source text is 600 characters, splitting the source text with a period as a boundary, obtaining four sentences through splitting, where the respective lengths are 160 characters, 130 characters, 200 characters and 110 characters, if the preset length is 512 characters, splicing the split sentences according to the language order of the source text, and because the preset length is 512 characters, the length of the text to be spliced is less than or equal to 512 characters, because 160+130+200=490 characters, 490 characters < 512 characters, 160+130+200+110=600 characters >512 characters, i.e. we splice the first three sentences of the source text in order to obtain a first spliced text, and the fourth sentence of the source text as a second spliced text, respectively and simultaneously inputting the part-of-speech tagging feature module and the text characterization module into a first spliced text and a second spliced text to respectively obtain an M + N-dimensional feature vector of the first spliced text and an M + N-dimensional feature vector of the second spliced text, then respectively inputting the M + N-dimensional vector of the first spliced text and the second spliced text into a trained tree model classifier for scoring, if the M + N-dimensional vector output of the first spliced text is 1 and the M + N-dimensional vector output of the second spliced text is 0, the first spliced text is an effective text and needs to be reserved, the second spliced text is an ineffective text and does not need to be reserved, and the first spliced text and the second spliced text are both segmented corpora.

Referring to fig. 4, a pre-training corpus cleaning system 1 according to a second embodiment of the present invention includes the following modules:

the acquisition module 10: the method comprises the steps that the method is used for presetting a corpus, wherein the preset corpus comprises a plurality of corpora, and the plurality of corpora are segmented based on a preset method to obtain a plurality of segmented corpora;

the cleaning module 20: the system is used for cleaning a plurality of segmented corpora and judging whether each segmented cleaned corpora meets a preset condition or not;

the processing module 30: the language material processing device is used for taking the segmented and cleaned language material meeting the preset condition as the current language material, and if not, discarding the segmented and cleaned language material not meeting the preset condition;

the splicing module 40: and the method is used for splicing the segmented and cleaned linguistic data which need to be reserved in each linguistic data to obtain the cleaned pre-training linguistic data.

It is understood that the modules of the pre-corpus cleaning system 1 need to use the pre-corpus cleaning method provided in the first embodiment when operating, and therefore, it is within the scope of the present invention to integrate or configure different hardware to generate the functions similar to the effects achieved by the present invention for the acquisition module 10, the cleaning module 20, the processing module 30, and the splicing module 40.

A third embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for pre-corpus cleaning according to the first embodiment of the present invention.

It will be appreciated that the processes described above with reference to the flowcharts may be implemented as computer software programs, in accordance with the disclosed embodiments of the invention. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium described in this application can be a computer readable signal medium or a storage medium or any combination of the two. A storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Compared with the prior art, the method, the system and the storage medium for cleaning the pre-training corpus, provided by the invention, have the following advantages:

2. In the steps of the pre-training corpus cleaning method, each corpus in a pre-training corpus is traversed, any corpus in the pre-training corpus is taken as a current corpus, the length of the current corpus is judged, whether the length of the current corpus is smaller than or equal to the preset length or not is judged, if yes, the corpus is not required to be segmented, and the current corpus smaller than or equal to the preset length can be directly taken as the segmented corpus; if not, segmenting the current corpus by taking the period number as a partition to obtain a plurality of single sentences of the current corpus, splicing the plurality of single sentences in sequence to obtain a plurality of spliced texts, wherein the spliced texts need to meet the condition of being less than or equal to the preset length, splicing the redundant part to obtain another spliced text, and ensuring that the spliced text is less than or equal to the preset length.

3. The linguistic data is cleaned through a part-of-speech tagging feature module and a text representation module, when tagging is carried out, the segmented linguistic data is simultaneously input into the part-of-speech tagging feature module and the text representation module, the part-of-speech tagging feature module carries out part-of-speech tagging operation on the segmented linguistic data, and statistics is carried out on various part-of-speech tagging features in the segmented linguistic data, the text representation module is based on a Mengzi-BERT-large model, the hidden state of the last hidden layer of the Mengzi-BERT-large model is used as the output of the text representation module, and the part-of-speech tagging feature module and the text representation module are output and spliced to be used as the input of a subsequent tree model classifier and used for judging whether the linguistic data is the linguistic data needing to be reserved or not.

4. The method comprises the steps of simultaneously and respectively inputting segmented corpora into a part-of-speech tagging feature module and a text representation module to respectively obtain M-dimensional part-of-speech feature vectors and N-dimensional text representation vectors, splicing the M-dimensional part-of-speech feature vectors and the N-dimensional text representation vectors corresponding to the segmented corpora in a preset mode to obtain M + N-dimensional vector representations of each corpora, and finally inputting the M + N-dimensional vector representations of each corpora into a trained tree model classifier model for judgment to obtain which corpora need to be reserved and which need not to be reserved, and judging by the tree model classifier to filter low-quality junk texts and reduce the total amount of the corpora during pre-training, thereby improving the pre-training efficiency of the model.

The method, the system and the storage medium for cleaning the pre-training corpus disclosed in the embodiment of the present invention are described in detail, a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A pre-training corpus cleaning method is characterized by comprising the following steps: the method comprises the following steps:

2. The method for cleaning pre-corpus of claim 1, wherein: the method comprises the following steps of obtaining a preset corpus, wherein the preset corpus comprises a plurality of corpora, segmenting the plurality of corpora based on a preset method, and obtaining the plurality of segmented corpora specifically comprises the following steps:

if so, taking the current corpus as the segmented corpus;

if not, segmenting the current corpus by taking a full stop as a partition to obtain a plurality of single sentences of the current corpus, splicing the plurality of single sentences of the current corpus in sequence to obtain a plurality of spliced texts, wherein the length of the spliced texts is less than or equal to the preset length, and recording the plurality of spliced texts as a plurality of segmented corpuses.

3. The pre-corpus cleaning method according to claim 2, wherein: the predetermined length ranges from 64 characters to 512 characters.

4. The pre-corpus cleaning method of claim 1, further comprising: the method is characterized in that: and the corpus is cleaned through a part-of-speech tagging characteristic module and a text representation module.

5. The pre-corpus cleaning method according to claim 4, wherein: the step of cleaning the multiple segmented corpora to obtain multiple cleaned segmented corpora, and the step of judging whether the multiple segmented cleaned corpora meet the preset conditions specifically includes the following steps:

6. The pre-corpus cleaning method according to claim 5, wherein: the tree model classifier is a trained LightGBM model.

7. The pre-corpus cleaning method according to claim 5, wherein: the preset splicing mode is a Concatenate splicing mode.

8. The method for cleaning pre-corpus of claim 1, wherein: the loss function for training the tree model classifier is:

9. The utility model provides a pre-training corpus cleaning system which characterized in that: the system comprises the following modules:

a cleaning module: the language material processing device is used for cleaning a plurality of divided language materials and judging whether each divided and cleaned language material meets a preset condition or not;

splicing the modules: and the method is used for splicing the segmented and cleaned linguistic data which need to be reserved in each linguistic data to obtain the cleaned pre-training linguistic data.

10. A storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, implements a method for pre-corpus cleansing according to any of claims 1-8.