CN116561298A

CN116561298A - Title generation method, device, equipment and storage medium based on artificial intelligence

Info

Publication number: CN116561298A
Application number: CN202310526678.2A
Authority: CN
Inventors: 郑文俊; 魏思思; 魏巍; 梁硕; 鲁镇仪
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2023-05-11
Filing date: 2023-05-11
Publication date: 2023-08-08

Abstract

The embodiment of the application belongs to the field of artificial intelligence, and relates to a title generation method based on artificial intelligence, which comprises the following steps: acquiring a pre-acquired title data set; cleaning the titles in the title data set based on various data cleaning rules to obtain target text data; training the pre-training language model by using target text data as training data based on the countermeasure training strategy and the noise reduction strategy to obtain a title generation model; acquiring a target text to be identified; and inputting the target text into a title generation model, and processing the target text through the title generation model to generate a corresponding target title. The application also provides a title generation device, a computer device and a storage medium based on the artificial intelligence. In addition, the present application relates to blockchain technology in which object titles may be stored. According to the method and the device, the title generation model is used for rapidly and accurately generating the target title corresponding to the target text to be identified, and the accuracy of the generated target title is guaranteed.

Description

Title generation method, device, equipment and storage medium based on artificial intelligence

Technical Field

The present disclosure relates to the field of artificial intelligence development, and in particular, to an artificial intelligence-based title generation method, apparatus, computer device, and storage medium.

Background

With the explosive growth of internet information, a great deal of network text presents new challenges for automated information processing. Among them, the title generation of text is an important part of information processing due to its functions of compressing information, simplifying text, and refining a subject matter.

At present, aiming at the problem of automatically generating the open-domain article titles, a traditional method based on rules such as word weight is generally adopted for processing, and the traditional method has high speed, but has simple algorithm, is difficult to obtain good effect on natural texts, and cannot guarantee the accuracy of generating the titles.

Disclosure of Invention

The embodiment of the application aims to provide a title generation method, device, computer equipment and storage medium based on artificial intelligence, so as to solve the technical problems that the existing method for generating the title by adopting rules such as word weight is simple in algorithm, difficult to obtain good effect on natural texts and cannot guarantee the accuracy of title generation.

In order to solve the above technical problems, the embodiments of the present application provide a title generation method based on artificial intelligence, which adopts the following technical scheme:

acquiring a pre-acquired title data set; the title data set comprises a plurality of pieces of data, and each piece of data comprises a title and an article;

cleaning the titles in the title data set based on a plurality of preset data cleaning rules to obtain cleaned target text data;

training a pre-training language model by using the target text data as training data based on a preset countermeasure training strategy and a noise reduction strategy to obtain a title generation model; wherein the pre-trained language model is a language model adopting a T5 encoder-decoder architecture;

acquiring a target text to be identified;

and inputting the target text into the title generation model, and processing the target text through the title generation model to generate a target title corresponding to the target text.

Further, the step of training the pre-training language model by using the target text data as training data based on a preset countermeasure training strategy and a noise reduction strategy to obtain a title generation model specifically includes:

Performing expansion search on the target text data by using a preset countermeasure training algorithm to obtain a countermeasure sample;

masking the countermeasure sample to obtain a masking training sample;

inputting the mask training sample into the pre-training language model, and training the pre-training language model by using the mask training sample to obtain a first pre-training language model;

adjusting parameters of the first pre-training language model based on a preset trust loss function until the trust loss function meets a preset convergence condition to obtain a trained second pre-training language model;

and taking the second pre-training language model as the title generation model.

Further, the step of cleaning the titles in the title data set based on the preset multiple data cleaning rules to obtain cleaned target text data specifically includes:

cleaning the titles in the title data set based on a preset character cleaning rule to obtain first text data;

cleaning titles in the first text data based on a preset fact consistency rule to obtain second text data;

Cleaning the titles in the second text data based on a preset language smoothing rule to obtain third text data;

and taking the third text data as the target text data.

Further, the step of cleaning the titles in the title data set based on a preset character cleaning rule to obtain first text data specifically includes:

deleting a first character which accords with a preset abnormal condition in a title contained in the title data set to obtain processed fourth text data;

replacing a second character which accords with a preset character conversion condition in a title contained in the fourth text data to obtain fifth text data;

acquiring the title length of a title in the fifth text data, and screening a specified title with the title length within a preset first length range from the fifth text data;

and screening sixth text data corresponding to the appointed title from the fifth text data to obtain the first text data.

Further, the step of cleaning the title in the first text data based on the preset fact consistency rule to obtain second text data specifically includes:

For each piece of first text data, screening seventh text data of which the maximum common subsequence length between titles and articles is smaller than a preset first length threshold value from the first text data;

deleting the seventh text data from the first text data to obtain eighth text data;

for each piece of eighth text data, screening ninth text data with the number of entities, the title of which does not appear in the corresponding article, greater than a preset first number threshold value from the eighth text data;

and deleting the ninth text data from the eighth text data to obtain the second text data.

Further, the step of cleaning the title in the second text data based on the preset language smoothing rule to obtain third text data specifically includes:

for each piece of second text data, screening first appointed text data with the length of the title larger than a preset second length threshold value from the second text data;

clearing the first appointed text data in the second text data to obtain second appointed text data;

for each piece of second specified text data, screening third specified text data with the number of punctuation marks larger than a preset second number threshold value from the second specified text data;

Clearing the third appointed text data in the second appointed text data to obtain fourth appointed text data;

for each piece of fourth specified text data, acquiring fifth specified text data of which the repeated substrings with the titles meet a preset proportion threshold value from the fourth specified text data;

and clearing the fifth appointed text data in the fourth appointed text data to obtain the third text data.

Further, after the step of inputting the target text into the title generation model and processing the target text by the title generation model to generate a target title corresponding to the target text, the method further comprises:

performing character cleaning on the target title based on a preset cleaning rule to obtain a first title;

converting the first title based on a preset processing specification to obtain a second title;

and storing the second title.

In order to solve the above technical problems, the embodiments of the present application further provide an artificial intelligence based title generation device, which adopts the following technical scheme:

the first acquisition module is used for acquiring a pre-acquired title data set; the title data set comprises a plurality of pieces of data, and each piece of data comprises a title and an article;

The first cleaning module is used for cleaning the titles in the title data set based on a plurality of preset data cleaning rules to obtain cleaned target text data;

the training module is used for training the pre-training language model by using the target text data as training data based on a preset countermeasure training strategy and a noise reduction strategy to obtain a title generation model; wherein the pre-trained language model is a language model adopting a T5 encoder-decoder architecture;

the second acquisition module is used for acquiring a target text to be identified;

and the generation module is used for inputting the target text into the title generation model, processing the target text through the title generation model and generating a target title corresponding to the target text.

In order to solve the above technical problems, the embodiments of the present application further provide a computer device, which adopts the following technical schemes:

acquiring a target text to be identified;

In order to solve the above technical problems, embodiments of the present application further provide a computer readable storage medium, which adopts the following technical solutions:

Acquiring a target text to be identified;

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

in the embodiment of the application, a pre-acquired title data set is firstly acquired; then cleaning the titles in the title data set based on a plurality of preset data cleaning rules to obtain cleaned target text data; then training a pre-training language model by using the target text data as training data based on a preset countermeasure training strategy and a noise reduction strategy to obtain a title generation model; subsequently obtaining a target text to be identified; and finally, inputting the target text into the title generation model, and processing the target text through the title generation model to generate a target title corresponding to the target text. Based on the existing end-to-end pre-training language model architecture, training data improvement and model algorithm improvement are implemented, titles in the title data set are cleaned by using various data cleaning rules to obtain a high-quality title training data set, then the pre-training language model is trained by using an improvement method of an anti-training strategy and a noise reduction strategy, a required title generation model can be quickly and accurately generated, robustness of title generation of the title generation model in an open domain scene is effectively improved, noise reduction capability of title generation of the title generation model is improved, rapid and accurate generation of a target title corresponding to a target text to be identified is realized by using the title generation model, and accuracy of the generated target title is guaranteed.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of an artificial intelligence based title generation method according to the present application;

FIG. 3 is a schematic diagram of one embodiment of an artificial intelligence based title generation apparatus according to the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the title generating method based on artificial intelligence provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the title generating device based on artificial intelligence is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow chart of one embodiment of an artificial intelligence based title generation method according to the present application is shown. The title generation method based on artificial intelligence comprises the following steps:

Step S201, acquiring a pre-acquired title data set; wherein the title dataset comprises a plurality of pieces of data, each piece of data comprising a title and an article.

In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the artificial intelligence-based title generation method operates may acquire the title data set through a wired connection or a wireless connection. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection. The title data set specifically refers to a Chinese open domain title data set covering political, economic, social, technological, financial, sports, etc. fields. The process of obtaining the Chinese open domain header data may include: acquiring a plurality of preset text acquisition channels; acquiring an initial Chinese open domain title data set of each text acquisition channel; and integrating all the initial Chinese open domain title data sets to obtain the Chinese open domain title data sets. Wherein, the text acquisition channel can include at least: LCSTS, RASG, THUCNews, NLPCC, CNewSum, news163, cartata, etc. The chinese open domain header dataset may include at least: weibo news headline data sets, new wave news 2005-2011 data, NLPCC single document summary competition data, toutiao data, extended versions of NLPCC, internet news headline data, good driver news headline data, and the like.

Step S202, cleaning the titles in the title data set based on a plurality of preset data cleaning rules to obtain cleaned target text data.

In this embodiment, by cleaning the titles in the title data set, but not cleaning the articles in the title data set, the robustness of the model to the processing of the articles in the open domain can be effectively improved. The specific implementation process of cleaning the titles in the title data set based on the preset multiple data cleaning rules to obtain the cleaned target text data will be described in further detail in the following specific embodiments, which will not be described in any more detail herein.

Step S203, training a pre-training language model by using the target text data as training data based on a preset countermeasure training strategy and a noise reduction strategy to obtain a title generation model; wherein the pre-trained language model is a language model employing a T5 encoder-decoder architecture.

In this embodiment, specifically, the pre-training language model uses a T5 encoder-decoder architecture, and T5 is implemented based on a transducer model, stacking attention blocks and linear layer blocks. The encoding end aims to acquire hidden characteristic representations of the input text. Mainly using a bi-directional self-attention mechanism and a forward linear layer, and using a layer regularization and residual mechanism. The function of the decoding end is to map the hidden characteristic back to the text space and output the predicted text. The basic architecture is similar to the encoding side, except that the attention mechanism uses masking self-attention and cross-attention. The T5 model has stronger text generation capability and has more application in the field of natural language. However, since the original model is an english model, generating scenes for chinese open-domain titles has not yet been popularized and applied. The title generation model applied to the Chinese field is trained by sorting and cleaning a large-scale Chinese open domain title data set, and the title generation model is correspondingly optimized for the fact consistency and language smoothness. In specific training, an article needing to generate a title is input by an encoding end, a reference title is input by a decoding end, a model generates the title according to the article, and then the cross entropy loss of the generated title and the reference title is calculated for training. The training of the pre-training language model by using the target text data as training data based on the preset countermeasure training policy and the noise reduction policy to obtain a specific implementation process of the title generation model will be described in further detail in the following specific embodiments, which will not be described herein.

Step S204, obtaining a target text to be recognized.

In this embodiment, the target text is text data of a title to be identified.

Step S205, inputting the target text into the title generation model, and processing the target text through the title generation model to generate a target title corresponding to the target text.

Firstly, acquiring a pre-acquired title data set; then cleaning the titles in the title data set based on a plurality of preset data cleaning rules to obtain cleaned target text data; then training a pre-training language model by using the target text data as training data based on a preset countermeasure training strategy and a noise reduction strategy to obtain a title generation model; subsequently obtaining a target text to be identified; and finally, inputting the target text into the title generation model, and processing the target text through the title generation model to generate a target title corresponding to the target text. Based on the existing end-to-end pre-training language model architecture, training data improvement and model algorithm improvement are implemented, titles in the title data set are cleaned by using various data cleaning rules to obtain a high-quality title training data set, then the pre-training language model is trained by using an improvement method of an anti-training strategy and a noise reduction strategy, a required title generation model can be quickly and accurately generated, robustness of title generation of the title generation model in an open domain scene is effectively improved, noise reduction capability of title generation of the title generation model is improved, rapid and accurate generation of a target title corresponding to a target text to be identified is realized by using the title generation model, and accuracy of the generated target title is guaranteed.

In some alternative implementations, experiments have shown that the title generation method of the present application achieves better results than existing methods. The experiment was evaluated using a benchmark open field text generation dataset, LCSTS (Large scale Chinese Short Text Summarization). The detailed information of the corpus is shown in table 1, and the application takes a test set part for experimental verification.

	Training set	Verification set	Test set
				Number of samples	2400591	8685	725
Average length of text	103.68	107.86	108.10
				Average length of abstract	17.86	18.15	18.66

TABLE 1

The experimental part aims at evaluating the effectiveness of the title generation model proposed by the application, so that the application takes the standard F1 value scores of Rouge-1, rouge-2 and Rouge-L, BLEU of the Chinese text abstract as the evaluation indexes of the model, and experimental comparison results are given in table 2.

TABLE 2

Notably, from table 2, it can be observed that the title generation method proposed in the present application is always superior to all of these benchmark models in a number of automated metrics. Wherein, T5-Origin is a multi-language T5 version model issued by authorities, T5-trimming is a T5 model for simple trimming on an LCTS data set, and T5-Copy is a T5 model added with a Copy mechanism. As can be seen from the above table, on the same baseline model T5, the method proposed by the present application far exceeded T5-Origin, T5-refinement and T5-Copy, demonstrating that (1) the high quality chinese open-domain header training dataset proposed by the present application is effective for this task; (2) The training algorithm optimization provided by the application is effective for the task, although the introduction of a Copy mechanism can bring about improvement in performance, the occurrence of repeated words can reduce the smoothness of a text, and the model is generated in an open domain by the methods of countermeasure training strategies, mask generation, trust cross entropy loss and the like to have robustness and noise reduction capability; (3) The comparison experiment also shows that the heuristic post-processing method provided by the application can also improve the consistency of the model generation result and the original text content.

In some alternative implementations, step S203 includes the steps of:

and performing expansion search on the target text data by using a preset countermeasure training algorithm to obtain countermeasure samples.

In this embodiment, the countermeasure learning is a common method for improving the robustness of the model, and the method is to add a tiny noise disturbance to the input data to increase the difficulty of model fitting. In this way, the model learns to resist some random disturbance, so that the sensitivity to input or parameters is reduced, the model does not depend on training set data too much, and the generalization performance of the model in an open domain scene is improved. Specifically, the above-described countermeasure training algorithm specifically uses the SMART algorithm by adding two regularization losses: smoothing against regularization loss and Bregman approximate point optimization loss. The smooth irregular loss requires that the model can output a consistent result in a certain disturbance range, so that overfitting is avoided, and generalization capability is improved. The Bregman approximate point optimization hopes that the model parameter distribution is similar to the initial distribution of each round, avoids forgetting phenomenon of the model on the pre-training parameters, and keeps the pre-training knowledge as much as possible. Specifically, smoothing adds a small range of perturbations to the model's Embedding layer against the irregular loss, requiring the resulting output to remain as unchanged as possible. Specifically, let x be _i The input of the representation is embedded in,representing the input embedding after disturbance, f denotes a model, simKL is the symmetrical KL divergence, there is +.>Model training is required to minimize the following losses: />Bregman approximate point optimization then iterates the resulting parameter θ_t before the corresponding prevention model parameter θ_ (t+1) deviates too much, by minimizing the following losses: />

And carrying out mask processing on the countermeasure sample to obtain mask training samples.

And inputting the mask training sample into the pre-training language model, and training the pre-training language model by using the mask training sample to obtain a first pre-training language model.

In this embodiment, by referencing the masking task adopted by the pre-training language model, when fine-tuning is performed on the chinese data, random interval masking substitution is performed on the input (output) at the decoding end in the pre-training language model, that is, the interval sequence to be replaced in the text string is selected and replaced uniformly by the [ MASK ] special character, and then the pre-training language model is allowed to recover the original sequence according to the input article at the encoding end and the part not replaced by the masking, and the loss is calculated through cross entropy. The method improves the noise reduction capability and the inference capability of the model by artificially introducing mask noise in decoding. In addition, unlike the original pre-training task, which introduces mask loss at the encoding end, the encoding end does not clean the article in the data processing, so that more unnecessary noise (such as special characters, unusual sentence patterns and the like) exists, and if mask training is performed on the article, the stability of model generation can be affected, so that the mask training task is performed at the decoding end.

And adjusting the parameters of the first pre-training language model based on a preset trust loss function until the trust loss function meets a preset convergence condition, so as to obtain a trained second pre-training language model.

In this embodiment, the labeling data for the article titles is not necessarily completely correct, and the cross entropy used by training the pre-training language model tends to enable the model to output the same result as labeling, thereby resulting in a false fitting of the model. To solve this problem, it is noted that models can generally fit simple correct samples quickly in the early stages, and incorrect samples and difficult correct samples can be fitted gradually in the later stages of training. Therefore, a trust loss function is used for replacing a cross entropy function, a probability term p=p (k|x) predicted by a model is added on the basis of the original labeling probability distribution q=q (k|x) (x is a corresponding generation index and k is a vocabulary label),a tradeoff of data annotation and trust self-prediction is achieved. Thus replacing cross entropy with trust loss function L _DCE Where δ represents the ratio of trust, a near 1 represents the distribution that the model would trust more predicted, and a near 0 represents the distribution of more trust annotations: l (L) _DCE ＝-plog(δp+(1-δ)q)。

According to the method, the target text data are subjected to expansion search by using a preset countermeasure training algorithm, so that a countermeasure sample is obtained; then carrying out mask processing on the countermeasure sample to obtain a mask training sample; inputting the mask training sample into the pre-training language model, and training the pre-training language model by using the mask training sample to obtain a first pre-training language model; and subsequently, adjusting the parameters of the first pre-training language model based on a preset trust loss function until the trust loss function meets a preset convergence condition, obtaining a trained second pre-training language model, and taking the second pre-training language model as the title generation model. According to the method, on the basis of the encoder and decoder architecture of the pre-training language model, the pre-training language model is trained by using the improved method of the countermeasure training strategy and the noise reduction strategy, so that a required title generation model can be quickly and accurately generated, the robustness of the title generation model in the open domain generation is effectively improved, and the noise reduction capability of the title generation model in the open domain generation is improved.

In some alternative implementations of the present embodiment, step S202 includes the steps of:

and cleaning the titles in the title data set based on a preset character cleaning rule to obtain first text data.

In this embodiment, only the headlines are cleaned, and the articles are not cleaned. The robustness of the model to article handling in the open domain can be improved. The specific implementation process of cleaning the titles in the title data set based on the preset character cleaning rule to obtain the first text data will be described in further detail in the following specific embodiments, which will not be described herein.

And cleaning the titles in the first text data based on a preset fact consistency rule to obtain second text data.

In this embodiment, titles in the data set that are inconsistent with the facts of the articles may be removed based on a fact consistency rule corresponding to the fact consistency requirement. The specific implementation process of cleaning the header in the first text data based on the preset fact consistency rule to obtain the second text data will be described in further detail in the following specific embodiments, which will not be described herein.

And cleaning the titles in the second text data based on a preset language smoothing rule to obtain third text data.

In this embodiment, the title of the language non-specification in the data set may be removed based on the language compliance rule corresponding to the language compliance requirement. The specific implementation process of the third text data obtained by cleaning the header in the second text data based on the preset language compliance rule will be described in further detail in the following specific embodiments, which will not be described herein.

And taking the third text data as the target text data.

According to the method, titles in the title data set are cleaned based on a preset character cleaning rule, so that first text data are obtained; then cleaning titles in the first text data based on a preset fact consistency rule to obtain second text data; then cleaning titles in the second text data based on a preset language smoothing rule to obtain third text data; and taking the third text data as the target text data. After the title data set is collected, the title data set is cleaned by using the character cleaning rule, the fact consistency rule and the language smoothness rule, so that the high-quality open domain title training data set can be quickly and accurately obtained, the generation capacity of an open domain of a title generation model which is generated subsequently is improved from the source, and the fact consistency and the language smoothness are improved.

In some optional implementations, the cleaning, based on a preset character cleaning rule, the title in the title data set to obtain the first text data includes the following steps:

and deleting the first character which accords with the preset abnormal condition in the title contained in the title data set to obtain the processed fourth text data.

In this embodiment, the first character meeting the preset abnormal condition specifically refers to html characters, emoij symbols, too many exclamation mark question marks and other language characters, improper spaces between chinese, and so on.

And replacing the second character which accords with the preset character conversion condition in the title contained in the fourth text data to obtain fifth text data.

In this embodiment, some characters in the four-text data are unified and normalized, for example, full-angle numbers and letters are all converted into half-angles, and commas, question marks, semicolons, exclamation marks, various brackets and the like of the Chinese are all replaced by English corresponding characters, but Chinese periods, ellipses, quotations, signature numbers and dashes are reserved, so that Chinese semantics are reserved to the greatest extent.

And acquiring the title length of the title in the fifth text data, and screening the appointed title with the title length within a preset first length range from the fifth text data.

In this embodiment, too short and too long titles in the data set are detrimental to model learning, since too short titles cannot fully summarize the contents of the article, and too long titles may be redundant. The titles are thus filtered by a preset first length range (e.g., 15-40 words) length.

Deleting a first character which accords with a preset abnormal condition in a title contained in the title data set to obtain processed fourth text data; then, replacing a second character which accords with a preset character conversion condition in a title contained in the fourth text data to obtain fifth text data; acquiring the title length of a title in the fifth text data, and screening a specified title with the title length within a preset first length range from the fifth text data; and subsequently, screening sixth text data corresponding to the appointed title from the fifth text data to obtain the first text data. After the title data set is acquired, the title data set is cleaned by using the character cleaning rule to obtain the first text data, so that the first text data can be cleaned by using the fact consistency rule and the language smoothness rule, and the high-quality open domain title training data set is obtained rapidly and accurately, so that the generation capacity of an open domain of a title generation model generated subsequently is improved from the source, and the fact consistency and the language smoothness are improved.

In some optional implementations, the cleaning the title in the first text data based on a preset fact consistency rule to obtain second text data includes the following steps:

and for each piece of first text data, screening out seventh text data of which the maximum common subsequence length between the titles and the articles is smaller than a preset first length threshold value from the first text data.

In this embodiment, the maximum common subsequence length of headlines and articles is calculated. If the common subsequence length of the title and the article is too short, the title is proved to have high probability and the article is completely irrelevant, and the removal is needed. The method of calculating the common subsequence uses dynamic programming, which can be done in time within O (n 2). The value of the first length threshold is not specifically limited, and may be set according to actual use requirements.

And deleting the seventh text data from the first text data to obtain eighth text data.

And for each piece of eighth text data, screening ninth text data with the number of entities, the titles of which do not appear in the corresponding articles, greater than a preset first number threshold value from the eighth text data.

In this embodiment, by checking the number of entities in the title that do not appear in the article, if the number is too large, it is explained that the title is highly probable and the article is not consistent at the entity level, so that the fact consistency is reduced, and rejection is required. The value of the first number of thresholds is not specifically limited, and may be set according to actual use requirements.

For each piece of first text data, firstly screening seventh text data with the maximum common subsequence length between titles and articles smaller than a preset first length threshold value from the first text data; then deleting the seventh text data from the first text data to obtain eighth text data; then, for each piece of eighth text data, screening ninth text data with the number of entities, the title of which does not appear in the corresponding article, greater than a preset first number threshold value from the eighth text data; and deleting the ninth text data from the eighth text data to obtain the second text data. After the collected title data set is cleaned by using the character cleaning rule to obtain the first text data, the first text data is cleaned by using the fact consistency rule to obtain the second text data, so that the generation capacity of an open domain of a title generation model which is generated later is improved from the source and the fact consistency is improved. The method is favorable for cleaning the first text data by further using the language smoothness rule, so that a high-quality open domain title training data set is obtained rapidly and accurately, the open domain generation capacity of a title generation model which is generated subsequently is improved from the source, and the fact consistency and the language smoothness are improved.

In some optional implementations of this embodiment, the cleaning the header in the second text data based on the preset language smoothing rule to obtain third text data includes the following steps:

and for each piece of second text data, screening out first appointed text data with the length of the title larger than a preset second length threshold value from the second text data.

In this embodiment, the average length of one sentence is exceeded, but there is no punctuation mark representing a pause, often two sentences are mixed up, or the addition of punctuation marks is forgotten. Such data is unfavorable for model learning to generate a general statement, and is directly removed. The value of the second length threshold is not specifically limited, and may be set according to actual use requirements.

And clearing the first appointed text data in the second text data to obtain second appointed text data.

And screening third specified text data with the number of punctuation marks larger than a preset second number threshold value from the second specified text data for each piece of the second specified text data.

In this embodiment, the number of punctuations in the sentence is too large, and the number of characters which do not appear repeatedly, especially the number of Chinese characters, is too small. Often is a large number of language exclamation words, is unfavorable for training of model title generation, and is directly removed. The value of the second number of thresholds is not specifically limited, and may be set according to actual use requirements.

And clearing the third appointed text data in the second appointed text data to obtain fourth appointed text data.

And for each piece of fourth specified text data, acquiring fifth specified text data of which the repeated substring with the title meets a preset proportion threshold value from the fourth specified text data.

In this embodiment, there is a higher rate of repeated substrings in the sentence, as model generation is expected to maintain diversity, and such data is also culled. In order to improve the efficiency of detecting repeated substrings, text data can be preprocessed by adopting a suffix tree, all different repeated substrings are obtained on the suffix tree, screening is carried out according to whether the substrings are mutually contained or not, and finally all non-overlapped repeated substrings are processed. The value of the ratio threshold is not particularly limited, and may be set according to actual use requirements.

For each piece of second text data, first screening first appointed text data with the length of a title greater than a preset second length threshold value from the second text data; then, the first appointed text data is cleared in the second text data, and second appointed text data is obtained; then, for each piece of second specified text data, third specified text data with the number of punctuation marks larger than a preset second number threshold value is screened out from the second specified text data; clearing the third appointed text data in the second appointed text data to obtain fourth appointed text data; subsequently, for each piece of fourth appointed text data, acquiring fifth appointed text data of which the repeated substring with the title meets a preset proportion threshold value from the fourth appointed text data; and then clearing the fifth appointed text data in the fourth appointed text data to obtain the third text data. After the title data set is acquired, the title data set is cleaned by using the character cleaning rule to obtain the first text data, so that the first text data can be cleaned by using the fact consistency rule and the language smoothness rule, and the high-quality open domain title training data set is obtained rapidly and accurately, so that the generation capacity of an open domain of a title generation model generated subsequently is improved from the source, and the fact consistency and the language smoothness are improved.

In some optional implementations of this embodiment, after step S205, the electronic device may further perform the following steps:

and cleaning the characters of the target title based on a preset cleaning rule to obtain a first title.

In this embodiment, after the title generation model is used to generate the target title corresponding to the target text, special character cleaning and normalization are performed on the generated target title by using heuristic rules including incorrect entity prediction and substitution and rule-based correction based on stability and robustness of title generation under the open domain scene. The heuristic rules are similar to the large-scale Chinese open domain data arrangement part, namely specific contents of the cleaning rules can refer to the data cleaning rules, and redundant description is omitted here.

And converting the first title based on a preset processing specification to obtain a second title.

In this embodiment, the processing specification may include at least detecting whether the english case and space are consistent with the article, otherwise replacing Cheng Wenzhang the corresponding text. For the overlong part, the method truncates in advance according to the pause punctuation. Detecting repeated substrings, and performing corresponding deletion and other treatments.

And storing the second title.

In this embodiment, the storage mode of the second header is not limited, and for example, local storage, cloud storage, blockchain storage, and the like may be used.

The method comprises the steps of performing character cleaning on the target title based on a preset cleaning rule to obtain a first title; then converting the first title based on a preset processing specification to obtain a second title; and subsequently storing the second title. After the target title corresponding to the target text is generated by using the title generation model, the target title is subjected to perfecting processing by using a preset cleaning rule and a processing specification based on stability and robustness of title generation under an open domain scene so as to generate a final title, so that the fact consistency and the language smoothness of the title generation are further improved.

It is emphasized that the target title may also be stored in a blockchain node in order to further ensure privacy and security of the target title.

The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an artificial intelligence-based title generation apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 3, the title generating apparatus 300 based on artificial intelligence according to the present embodiment includes: a first acquisition module 301, a first cleaning module 302, a training module 303, a second acquisition module 304, and a generation module 305. Wherein:

A first acquisition module 301, configured to acquire a pre-acquired title data set; the title data set comprises a plurality of pieces of data, and each piece of data comprises a title and an article;

the first cleaning module 302 is configured to clean the titles in the title dataset based on a preset plurality of data cleaning rules, so as to obtain cleaned target text data;

the training module 303 is configured to train the pre-training language model by using the target text data as training data based on a preset countermeasure training policy and a noise reduction policy, so as to obtain a title generation model; wherein the pre-trained language model is a language model adopting a T5 encoder-decoder architecture;

a second obtaining module 304, configured to obtain a target text to be identified;

and the generating module 305 is used for inputting the target text into the title generating model, processing the target text through the title generating model, and generating a target title corresponding to the target text.

In this embodiment, the operations performed by the above modules or units are respectively corresponding to the steps of the artificial intelligence-based title generation method in the foregoing embodiment, and are not described herein again.

In some alternative implementations of the present embodiment, the training module 303 includes:

the expansion sub-module is used for carrying out expansion search on the target text data by using a preset countermeasure training algorithm to obtain a countermeasure sample;

the processing submodule is used for carrying out mask processing on the countermeasure samples to obtain mask training samples;

the training sub-module is used for inputting the mask training sample into the pre-training language model, and training the pre-training language model by using the mask training sample to obtain a first pre-training language model;

the adjustment sub-module is used for adjusting the parameters of the first pre-training language model based on a preset trust loss function until the trust loss function meets a preset convergence condition to obtain a trained second pre-training language model;

a first determination submodule for taking the second pre-training language model as the title generation model.

In some alternative implementations of the present embodiment, the first cleaning module 302 includes:

The first cleaning submodule is used for cleaning the titles in the title data set based on a preset character cleaning rule to obtain first text data;

the second cleaning submodule is used for cleaning the titles in the first text data based on a preset fact consistency rule to obtain second text data;

the third cleaning submodule is used for cleaning the titles in the second text data based on a preset language smoothing rule to obtain third text data;

and the second determining submodule is used for taking the third text data as the target text data.

In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based title generation method in the foregoing embodiment, and are not described herein again.

In some optional implementations of this embodiment, the first cleaning submodule includes:

a first deleting unit, configured to delete a first character that accords with a preset abnormal condition in a title included in the title data set, to obtain processed fourth text data;

a replacing unit, configured to replace a second character that meets a preset character conversion condition in a title included in the fourth text data, to obtain fifth text data;

A first screening unit, configured to obtain a header length of a header in the fifth text data, and screen a specified header with a header length within a preset first length range from the fifth text data;

and the second screening unit is used for screening sixth text data corresponding to the appointed title from the fifth text data to obtain the first text data.

In some optional implementations of this embodiment, the second cleaning submodule includes:

a third screening unit, configured to screen, for each piece of first text data, seventh text data, where a maximum common subsequence length between a title and an article is smaller than a preset first length threshold, from the first text data;

a second deleting unit, configured to delete the seventh text data from the first text data, to obtain eighth text data;

a fourth screening unit, configured to screen, for each piece of eighth text data, ninth text data whose number of entities whose titles do not appear in the corresponding articles is greater than a preset first number threshold value from the eighth text data;

And a third deleting unit, configured to delete the ninth text data from the eighth text data, to obtain the second text data.

In some optional implementations of this embodiment, the third cleaning sub-module includes:

a fifth filtering unit, configured to, for each piece of second text data, filter out, from the second text data, first specified text data having a header length greater than a preset second length threshold;

the first clearing unit is used for clearing the first appointed text data in the second text data to obtain second appointed text data;

a sixth screening unit, configured to screen, for each piece of second specified text data, third specified text data in which the number of punctuation marks is greater than a preset second number threshold from the second specified text data;

the second clearing unit is used for clearing the third appointed text data in the second appointed text data to obtain fourth appointed text data;

An obtaining unit, configured to obtain, for each piece of fourth specified text data, fifth specified text data in which a repeated substring existing in a title satisfies a preset proportion threshold value from the fourth specified text data;

and a third clearing unit, configured to clear the fifth specified text data in the fourth specified text data, to obtain the third text data.

In some optional implementations of the present embodiment, the artificial intelligence based title generation apparatus further includes:

the second cleaning module is used for cleaning the characters of the target title based on a preset cleaning rule to obtain a first title;

the conversion module is used for converting the first title based on a preset processing specification to obtain a second title;

and the storage module is used for storing the second title.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of an artificial intelligence-based title generation method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as executing computer readable instructions of the artificial intelligence based title generation method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the artificial intelligence-based title generation method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. An artificial intelligence based title generation method, comprising the following steps:

acquiring a target text to be identified;

2. The title generation method based on artificial intelligence according to claim 1, wherein the step of training a pre-training language model using the target text data as training data based on a preset countermeasure training policy and a noise reduction policy to obtain a title generation model specifically comprises:

masking the countermeasure sample to obtain a masking training sample;

3. The title generation method based on artificial intelligence according to claim 1, wherein the step of cleaning the title in the title data set based on preset multiple data cleaning rules to obtain cleaned target text data specifically comprises:

and taking the third text data as the target text data.

4. The title generation method based on artificial intelligence according to claim 3, wherein the step of cleaning the title in the title data set based on a preset character cleaning rule to obtain first text data specifically comprises:

5. The title generation method based on artificial intelligence according to claim 3, wherein the step of cleaning the title in the first text data based on a preset fact consistency rule to obtain second text data specifically comprises:

6. The title generation method based on artificial intelligence according to claim 3, wherein the step of cleaning the title in the second text data based on a preset language-generic rule to obtain third text data specifically comprises:

7. The artificial intelligence based title generation method according to claim 1, further comprising, after the step of inputting the target text into the title generation model, processing the target text by the title generation model, generating a target title corresponding to the target text:

and storing the second title.

8. An artificial intelligence based title generation apparatus, comprising:

9. A computer device comprising a memory having stored therein computer readable instructions which when executed implement the steps of the artificial intelligence based title generation method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the artificial intelligence based title generation method of any of claims 1 to 7.