CN110598205A

CN110598205A - Splicing method and device of truncated text and computer storage medium

Info

Publication number: CN110598205A
Application number: CN201910739896.8A
Authority: CN
Inventors: 刘逸哲
Original assignee: Dazhu (hangzhou) Technology Co Ltd
Current assignee: Dazhu (hangzhou) Technology Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2019-12-20
Anticipated expiration: 2039-08-12
Also published as: CN110598205B

Abstract

The invention discloses a splicing method and device of truncated texts, a computer storage medium and computing equipment. The method comprises the following steps: constructing an initial text generation model; inputting the arbitrarily truncated text into an initial text generation model, generating a truncated text corresponding to the arbitrarily truncated text, and splicing the arbitrarily truncated text and the corresponding truncated text to obtain a sentence to be distinguished; inputting the sentence to be distinguished into a sentence distinguishing model trained in advance to obtain a distinguishing result; determining whether to adjust model parameters of the initial text generation model or not according to the judgment result to obtain a target text generation model; inputting the truncated text to be spliced into a target text generation model, generating a truncated text corresponding to the truncated text to be spliced, and splicing the truncated text to be spliced and the corresponding truncated text. According to the embodiment of the invention, the target text generation model is used for generating the text, so that the accuracy of text generation can be improved, and the accuracy of splicing the truncated text can be further improved.

Description

Splicing method and device of truncated text and computer storage medium

Technical Field

The invention relates to the technical field of text processing, in particular to a splicing method and device of truncated texts, a computer storage medium and computing equipment.

Background

A large amount of truncated incomplete text exists in the original data such as the push information of a short message or an application program, user comments and article summaries captured by a network. Generally, for the effect of text modeling, in the course of NLP (Natural language processing), an incomplete text is usually determined, and then the incomplete text is discarded, which causes resource waste; or the time and the rule method of the data provider identification are directly adopted for splicing, and the splicing accuracy is not high. Therefore, it is desired to solve this technical problem.

Disclosure of Invention

In view of this, the invention provides a method and an apparatus for splicing truncated texts, a computer storage medium and a computing device, which improve the accuracy of the splicing result.

According to an aspect of the present invention, a method for splicing truncated texts is provided, including:

constructing an initial text generation model;

inputting any truncated text into the initial text generation model, generating a truncated text corresponding to the any truncated text, and splicing the any truncated text and the corresponding truncated text to obtain a sentence to be distinguished;

inputting the sentence to be distinguished into a sentence distinguishing model trained in advance to obtain a distinguishing result;

determining whether to adjust model parameters of the initial text generation model or not according to the judgment result to obtain a target text generation model;

inputting the truncated text to be spliced into the target text generation model, generating a truncated text corresponding to the truncated text to be spliced, and splicing the truncated text to be spliced and the corresponding truncated text.

Optionally, the building an initial text generation model includes:

collecting Chinese text corpora;

according to the probability of a preset proportion, carrying out truncation processing on sentences in the Chinese text corpus to obtain training data pairs containing truncated texts and corresponding truncated texts;

and training a language model by using the training data to obtain an initial text generation model.

Optionally, training a language model by using the training data to obtain an initial text generation model, including:

and training a language model by using a method for generating a pre-training GPT (general purpose test) to obtain an initial text generation model by taking the truncated text in the training data pair as input and the truncated text corresponding to the truncated text in the training data pair as output.

Optionally, determining whether to adjust a model parameter of the initial text generation model according to the determination result to obtain a target text generation model, including:

if the model parameters of the initial text generation model need to be adjusted according to the judgment result, taking the adjusted model as the target text generation model;

and if determining that the model parameters of the initial text generation model do not need to be adjusted according to the judgment result, directly taking the initial text generation model as the target text generation model.

Optionally, after the truncated text to be spliced and the corresponding truncated text are spliced, the method further includes:

inputting the target sentence obtained by splicing the truncated text to be spliced and the corresponding truncated text into the sentence judgment model, and determining whether the target sentence is used as a splicing result according to a judgment result.

Optionally, determining whether to use the target sentence as a concatenation result according to the determination result includes:

and if the judgment result is greater than or equal to a preset threshold value, taking the target sentence as a splicing result.

Optionally, the preset proportion is determined according to the proportion of the truncated texts to be spliced in the target processing text set.

According to another aspect of the present invention, there is provided a splicing apparatus for truncated texts, comprising:

the building module is suitable for building an initial text generation model;

the first processing module is suitable for inputting any truncated text into the initial text generation model, generating a truncated text corresponding to the any truncated text, and splicing the any truncated text and the corresponding truncated text to obtain a sentence to be distinguished;

the second processing module is suitable for inputting the sentence to be distinguished into a sentence distinguishing model trained in advance to obtain a distinguishing result; determining whether to adjust model parameters of the initial text generation model or not according to the judgment result to obtain a target text generation model;

and the third processing module is suitable for inputting the truncated texts to be spliced into the target text generation model, generating truncated texts corresponding to the truncated texts to be spliced, and splicing the truncated texts to be spliced and the corresponding truncated texts.

Optionally, the building module is further adapted to:

collecting Chinese text corpora;

Optionally, the building module is further adapted to:

Optionally, the second processing module is further adapted to:

Optionally, the third processing module is further adapted to:

after the truncated texts to be spliced and the corresponding truncated texts thereof are spliced, inputting target sentences obtained by splicing the truncated texts to be spliced and the corresponding truncated texts thereof into the sentence judgment model, and determining whether the target sentences are used as splicing results according to judgment results.

Optionally, the third processing module is further adapted to:

According to yet another aspect of the present invention, there is also provided a computer storage medium having computer program code stored thereon, which, when run on a computing device, causes the computing device to perform the above-mentioned method of splicing truncated texts.

According to yet another aspect of the present invention, there is also provided a computing device comprising: a processor; a memory storing computer program code; the computer program code, when executed by the processor, causes the computing device to perform the above-described method of splicing truncated texts.

By means of the technical scheme, the splicing method of the truncated texts comprises the steps of firstly constructing an initial text generation model, adjusting and optimizing the initial text generation model based on a sentence discrimination model to obtain a target text generation model, and generating the texts by using the target text generation model, so that the accuracy of the generated texts can be improved, and the accuracy of splicing the truncated texts is improved.

Furthermore, the method for generating the pre-training GPT is adopted when the initial text generation model is constructed, the performance of text processing is better, and the method can be applied to large data sets.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 shows a flow diagram of a method for splicing truncated texts according to an embodiment of the invention;

FIG. 2 illustrates a flow diagram of a method of building an initial text generation model, according to an embodiment of the invention; and

fig. 3 is a schematic structural diagram of a splicing apparatus for truncating a text according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

In order to solve the technical problem, an embodiment of the present invention provides a method for splicing truncated texts. Fig. 1 shows a flowchart of a method for splicing truncated texts according to an embodiment of the present invention. As shown in fig. 1, the method may include the following steps S101 to S105:

step S101, constructing an initial text generation model;

step S102, inputting the arbitrarily truncated text into an initial text generation model, generating a truncated text corresponding to the arbitrarily truncated text, and splicing the arbitrarily truncated text and the corresponding truncated text to obtain a sentence to be judged;

step S103, inputting the sentence to be distinguished into a sentence distinguishing model trained in advance to obtain a distinguishing result;

step S104, determining whether to adjust model parameters of the initial text generation model according to the judgment result to obtain a target text generation model;

and S105, inputting the truncated texts to be spliced into a target text generation model, generating truncated texts corresponding to the truncated texts to be spliced, and splicing the truncated texts to be spliced and the corresponding truncated texts.

According to the splicing method of the truncated texts provided by the embodiment of the invention, the initial text generation model is firstly established, the initial text generation model is adjusted and optimized based on the sentence discrimination model to obtain the target text generation model, and the target text generation model is utilized to generate the texts, so that the accuracy of text generation can be improved, and the accuracy of splicing the truncated texts is further improved.

In the above step S101, an initial text generation model is constructed, and an alternative scheme is provided in the embodiment of the present invention, as shown in fig. 2, the alternative scheme may include the following steps S201 to S203.

Step S201, Chinese text corpora are collected.

In this step, data such as user comments and article summaries may be captured by the web crawler as the chinese text corpus, or the chinese text corpus may be obtained from a preset database, which is not limited in this embodiment of the present invention.

Step S202, according to the probability of the preset proportion, the sentences in the Chinese text corpus are cut off, and training data pairs containing the cut-off texts and the corresponding cut-off texts are obtained.

In this step, the preset proportion may be determined according to a proportion, for example, 50% or 30%, of the truncated texts to be spliced in the target processed text set, and may be determined according to an actual requirement, which is not limited in the embodiment of the present invention.

For example, the training data pair { your express order received/received, your discount code expired/your verification code/123456 … … }, where the data before/before the slash is used as "truncated text" and the data after the slash is used as "truncated text corresponding to the truncated text".

Step S203, training the language model by using the training data to obtain an initial text generation model.

The language model mentioned in this step is simply the probability distribution of a string of words. The language model is used for predicting what the next word is according to the context, and the linguistic data does not need to be labeled manually, so that the language model can learn rich semantic knowledge from unlimited large-scale single-language linguistic data. The language model is trained by using training data, mainly the language model is pre-trained, and the pre-training idea is that the parameters of the model are not initialized randomly any more, but a set of model parameters are obtained by training a task first, and then the model is initialized by using the set of parameters and then trained.

In an optional embodiment, in step S203, the language model is trained by using the Training data, and specifically, a GPT (general Pre-Training) method may be used to train the language model by using a truncated text in the Training data pair as an input, and using a truncated text corresponding to the truncated text in the Training data pair as an output, so as to obtain an initial text generation model.

The GPT method is used for pre-training the language model and comprises an unsupervised pre-training language model and supervised fine tuning. In unsupervised pre-training of the language model, a standard language model objective function is used, and a Transformer decoder is used as the language model. And then, in a specific task of constructing an initial text generation model, carrying out supervised fine tuning by using training data, directly connecting softmax as a task output layer to the last layer of the language model of the Transformer without constructing a new model structure by GPT, and then carrying out fine tuning on the whole model to obtain the initial text generation model.

The embodiment of the invention adopts a GPT method to pre-train the language model when constructing the initial text generation model, has better text processing performance and can be applied to a large data set.

The pre-trained sentence discrimination model mentioned in step S103 may be a logistic regression model, and the like, which is not limited in this embodiment of the present invention.

In step S104, it is determined whether to adjust the model parameters of the initial text generation model according to the determination result to obtain the target text generation model, and specifically, if it is determined that the model parameters of the initial text generation model need to be adjusted according to the determination result, the adjusted model is used as the target text generation model; and if the model parameters of the initial text generation model do not need to be adjusted according to the judgment result, directly taking the initial text generation model as the target text generation model.

When the model parameters of the initial text generation model are adjusted, the embodiment of the invention can adopt a basic back propagation algorithm of random gradient descent and deep learning, and can also adopt a deformation algorithm of the back propagation algorithm of random gradient descent and deep learning, such as an Adam optimization algorithm, and the like, and the embodiment of the invention is not limited to this.

In an optional embodiment of the present invention, after the truncated text to be spliced and the corresponding truncated text are spliced in step S105, a target sentence obtained by splicing the truncated text to be spliced and the corresponding truncated text may be input into the sentence judgment model, and it is determined whether to take the target sentence as the splicing result according to the judgment result, so as to further improve the accuracy of the splicing result.

Further, if the judgment result is larger than or equal to a preset threshold value, the target sentence is used as a splicing result; if the judgment result is smaller than the preset threshold value, the target sentence is not taken as the splicing result, the model parameter of the target text generation model is prompted to be adjusted, the model parameter of the target text generation model can be further directly adjusted, then the adjusted text generation model is used for generating the truncated text corresponding to the truncated text to be spliced, and then the text splicing is carried out, so that the accuracy of the splicing result is improved.

When the model parameters of the target text generation model are adjusted, the embodiment of the invention can adopt a basic back propagation algorithm of random gradient descent and deep learning, and can also adopt a deformation algorithm of the back propagation algorithm of random gradient descent and deep learning, such as an Adam optimization algorithm, and the like, and the embodiment of the invention is not limited to this.

It should be noted that, in practical applications, all the above optional embodiments may be combined in a combined manner at will to form an optional embodiment of the present invention, and details are not described here any more.

Based on the splicing method of the truncated text provided by each embodiment, the embodiment of the invention also provides a splicing device of the truncated text based on the same inventive concept.

Fig. 3 is a schematic structural diagram of a splicing apparatus for truncating a text according to an embodiment of the present invention. As shown in fig. 3, the apparatus may include a building module 310, a first processing module 320, a second processing module 330, and a third processing module 340.

The functions of the components or devices of the text-truncated splicing apparatus according to the embodiment of the present invention and the connection relationship between the components will now be described:

a building module 310 adapted to build an initial text generation model;

the first processing module 320 is coupled with the building module 310 and is suitable for inputting any truncated text into the initial text generation model, generating a truncated text corresponding to the any truncated text, and splicing the any truncated text and the corresponding truncated text to obtain a sentence to be judged;

the second processing module 330, coupled to the first processing module 320, is adapted to input the sentence to be discriminated into the pre-trained sentence discrimination model to obtain a discrimination result; determining whether to adjust model parameters of the initial text generation model or not according to the judgment result to obtain a target text generation model;

the third processing module 340 is coupled to the second processing module 330, and is adapted to input the truncated text to be spliced into the target text generation model, generate a truncated text corresponding to the truncated text to be spliced, and splice the truncated text to be spliced and the corresponding truncated text.

In an alternative embodiment of the invention, the building module 310 is further adapted to:

collecting Chinese text corpora;

and training the language model by using the training data to obtain an initial text generation model.

and training a language model by using a method for generating a pre-training GPT (general purpose test) and taking a truncated text in a training data pair as input and a truncated text corresponding to the truncated text in the training data pair as output to obtain an initial text generation model.

In an alternative embodiment of the invention, the second processing module 330 is further adapted to:

if the model parameters of the initial text generation model need to be adjusted according to the judgment result, taking the adjusted model as a target text generation model;

and if the model parameters of the initial text generation model do not need to be adjusted according to the judgment result, directly taking the initial text generation model as the target text generation model.

In an alternative embodiment of the present invention, the third processing module 340 is further adapted to:

after splicing the truncated text to be spliced and the corresponding truncated text, inputting a target sentence obtained by splicing the truncated text to be spliced and the corresponding truncated text into a sentence judgment model, and determining whether the target sentence is used as a splicing result according to a judgment result.

and if the judgment result is greater than or equal to the preset threshold value, taking the target sentence as a splicing result.

Based on the same inventive concept, an embodiment of the present invention further provides a computer storage medium, where computer program codes are stored, and when the computer program codes are run on a computing device, the computer storage medium causes the computing device to execute the method for splicing truncated texts.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including: a processor; a memory storing computer program code; the computer program code, when executed by the processor, causes the computing device to perform the above-described method of splicing truncated texts.

It is clear to those skilled in the art that the specific working processes of the above-described systems, devices, units and modules may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, no further description is provided herein.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims

1. A splicing method of truncated texts is characterized by comprising the following steps:

constructing an initial text generation model;

2. The method of claim 1, wherein building an initial text generation model comprises:

collecting Chinese text corpora;

3. The method of claim 2, wherein training a language model using the training data to obtain an initial text generation model comprises:

4. The method of claim 1, wherein determining whether to adjust model parameters of the initial text generation model according to the determination result to obtain a target text generation model comprises:

5. The method according to claim 1, wherein after the splicing of the truncated text to be spliced and the corresponding truncated text, the method further comprises:

6. The method of claim 5, wherein determining whether to use the target sentence as the concatenation result according to the determination result comprises:

7. The method according to any one of claims 2-6, wherein the preset proportion is determined according to the proportion of truncated texts to be spliced in the target processed text set.

8. A splicing apparatus for truncating a text, comprising:

the building module is suitable for building an initial text generation model;

9. A computer storage medium having computer program code stored thereon which, when run on a computing device, causes the computing device to perform the method of splicing truncated texts according to any one of claims 1-7.

10. A computing device, comprising: a processor; a memory storing computer program code; the computer program code, when executed by the processor, causes the computing device to perform the method of splicing truncated texts according to any one of claims 1-7.