CN116975219A

CN116975219A - Data processing method, device, computer equipment and storage medium

Info

Publication number: CN116975219A
Application number: CN202310521840.1A
Authority: CN
Inventors: 林晨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-10-31

Abstract

The embodiment of the application discloses a data processing method, a device, computer equipment and a storage medium, which can be applied to an artificial intelligence scene and comprise the following steps: when an original text is obtained, dividing N sentences respectively based on influence degree of each sentence and an influence degree threshold value to obtain a positive sample set and a negative sample set; acquiring a first pre-training model; selecting sentences which are consistent with the set quantity of the abstracts from the positive sample set as first abstracts, and determining first abstract semantic vectors corresponding to the first abstracts; selecting sentences which are consistent with the set quantity of the abstracts from the negative sample set as second abstracts, and determining second abstract semantic vectors corresponding to the second abstracts; based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector, performing contrast learning on the first pre-training model to obtain a second pre-training model; the second pre-training model is used for processing the extracted abstract task. By adopting the embodiment of the application, the accuracy of abstract generation can be improved.

Description

Data processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, a computer device, and a storage medium.

Background

As the text data generated by the internet is more and more, the problem of text information overload is more and more serious, and it is very necessary to perform a dimension-reducing process on various types of texts, and the text abstract is one of important means, namely, converting the text data into a short abstract containing key information.

For the task of extracting text summaries, when facing a text data to be summarized, text summaries containing key information are manually extracted from the text data based on experience of an object (for example, a user), which means that the conventional method for generating the extracted text summaries relies heavily on manual experience, and is influenced by subjective factors of the object, that is, the extracted text summaries of different objects for the same business text may have differences, so that accuracy of summary generation is seriously affected.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, computer equipment and a storage medium, which can improve the accuracy of abstract generation.

An aspect of an embodiment of the present application provides a data processing method, including:

When an original text comprising N sentences is obtained, dividing the N sentences respectively based on influence degree and influence degree threshold value of each sentence to obtain a positive sample set and a negative sample set corresponding to the original text;

acquiring a first pre-training model; the first pre-training model is used for determining sample semantic vectors corresponding to each sentence respectively; the text semantic vector corresponding to the original text is determined by the sample semantic vector corresponding to each sentence respectively;

selecting sentences which are consistent with the set quantity of the abstracts from the positive sample set as a first abstracts, and determining a first abstract semantic vector corresponding to the first abstracts based on sample semantic vectors corresponding to the sentences in the first abstracts;

selecting sentences which are consistent with the set quantity of the abstracts from the negative sample set as second abstracts, and determining second abstract semantic vectors corresponding to the second abstracts based on sample semantic vectors corresponding to the sentences in the second abstracts;

based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector, performing contrast learning on the first pre-training model to obtain a second pre-training model; the second pre-training model is used for processing the extracted abstract task.

An aspect of an embodiment of the present application provides a data processing apparatus, including:

the sentence dividing module is used for dividing the N sentences respectively based on the influence degree and the influence degree threshold value of each sentence when the original text comprising the N sentences is acquired, so as to obtain a positive sample set and a negative sample set corresponding to the original text;

the model acquisition module is used for acquiring a first pre-training model; the first pre-training model is used for determining sample semantic vectors corresponding to each sentence respectively; the text semantic vector corresponding to the original text is determined by the sample semantic vector corresponding to each sentence respectively;

the first abstract extraction module is used for selecting sentences which are consistent with the abstract set quantity from the positive sample set as a first abstract, and determining a first abstract semantic vector corresponding to the first abstract based on sample semantic vectors corresponding to the sentences in the first abstract;

the second abstract extraction module is used for selecting sentences which are consistent with the abstract set quantity from the negative sample set as a second abstract, and determining a second abstract semantic vector corresponding to the second abstract based on sample semantic vectors corresponding to the sentences in the second abstract;

the contrast learning module is used for carrying out contrast learning on the first pre-training model based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector to obtain a second pre-training model; the second pre-training model is used for processing the extracted abstract task.

Wherein, sentence dividing module includes:

a model acquisition unit for acquiring a first text abstract model when acquiring an original text including N sentences;

the abstract prediction unit is used for inputting N sentences into the first text abstract model, and respectively carrying out abstract prediction on each sentence through the first text abstract model to obtain the influence degree of each sentence;

the sentence dividing unit is used for dividing N sentences respectively based on the influence degree of each sentence and the influence degree threshold value to obtain a positive sample set and a negative sample set corresponding to the original text.

Wherein, this abstract prediction unit includes:

the first representation acquisition subunit is used for inputting N sentences into the first text abstract model, and acquiring first sentence representations corresponding to each sentence respectively to obtain N first sentence representations; the first text summarization model includes a text encoder and a text decoder;

the second representation acquisition subunit is used for respectively encoding each sentence through the text encoder and N first sentence representations to obtain second sentence representations corresponding to each sentence;

the decoding processing subunit is used for inputting the N second sentence representations into the text decoder, and respectively decoding the N sentences through the text decoder and the N second sentence representations to obtain the influence degree of each sentence.

Wherein the N sentences include sentences S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N; the N first sentence representations comprise sentences S _i Corresponding first sentence representation E _i The method comprises the steps of carrying out a first treatment on the surface of the The text encoder comprises a first text encoder and a second text encoder;

wherein the second token acquisition subunit is further specifically configured to:

characterizing E by a first text encoder and a first sentence _i Sentence S _i Performing a first encoding process to obtain a sentence S _i A corresponding first encoding vector;

after N sentences respectively corresponding are obtainedWhen the first coding vectors are, N first coding vectors are input into a second text coder, and sentences S are processed through the second text coder and the N first coding vectors _i Performing a second encoding process to obtain a sentence S _i A corresponding second encoding vector;

will sentence S _i The corresponding second encoded vector is used as sentence S _i Corresponding second sentence representation D _i 。

Wherein the influence threshold comprises a first threshold;

the sentence dividing unit includes:

the first traversal subunit is used for traversing the N sentences and determining the traversed sentences as sentences to be divided;

the first adding subunit is configured to add the sentence to be divided to the positive sample set corresponding to the original text if the influence degree of the sentence to be divided is greater than or equal to a first threshold;

And the second adding subunit is used for adding the sentences to be divided into the negative sample set corresponding to the original text if the influence degree of the sentences to be divided is smaller than the first threshold value.

Wherein the influence threshold comprises a second threshold and a third threshold; the second threshold is greater than the third threshold;

the sentence dividing unit further includes:

the second traversing subunit is used for traversing the N sentences and determining the traversed sentences as sentences to be divided;

the third adding subunit is configured to add the sentence to be divided to the positive sample set corresponding to the original text if the influence degree of the sentence to be divided is greater than or equal to the second threshold;

the filtering subunit is used for filtering the sentences to be divided if the influence degree of the sentences to be divided is smaller than the second threshold value and larger than the third threshold value;

and the fourth adding subunit is configured to add the sentence to be divided to the negative sample set corresponding to the original text if the influence degree of the sentence to be divided is less than or equal to the third threshold.

Wherein the N sentences include sentences S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;

the apparatus further comprises:

sentence input module for inputting sentences S _i Inputting to a first pre-training model;

an initial sample vector determination module for sentence-based S _i Character characterization of each character in (1) to determine sentence S _i A corresponding initial sample vector; the character representation of a character is commonly determined by the word representation, the segment representation and the character position representation corresponding to the character;

a sample semantic vector determination module for passing through the first pre-training model and the sentence S _i Corresponding initial sample vector, sentence S _i Coding processing is carried out to obtain sentences S _i Corresponding sample semantic vectors.

The first abstract semantic vector and the second abstract semantic vector are average abstract semantic vectors; the average abstract semantic vector is obtained by carrying out average processing on sample semantic vectors of each sentence in the sentence set; the sentence set includes a first abstract and a second abstract.

Wherein, this contrast learning module includes:

the loss function acquisition unit is used for acquiring a model loss function of contrast learning;

the loss determination unit is used for determining model loss corresponding to the model loss function based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector;

the training result determining unit is used for training the first pre-training model based on model loss to obtain a model training result;

And the first model determining unit is used for taking the first pre-training model meeting the model convergence condition as a second pre-training model if the model training result indicates that the trained first pre-training model meets the model convergence condition.

Wherein, this contrast learning module still includes:

the parameter adjusting unit is used for adjusting the model parameters of the first pre-training model based on the model loss function which does not meet the model convergence condition if the model training result indicates that the trained first pre-training model does not meet the model convergence condition;

and the second model determining unit is used for taking the first pre-trained model after the model parameters are adjusted as a transition model, training the transition model, and taking the transition model meeting the model convergence condition as a second pre-trained model when the trained transition model meets the model convergence condition.

Wherein the apparatus further comprises:

the first building module is used for building an initial abstract model for processing the extraction abstract task based on the second pre-training model;

the first sample acquisition module is used for acquiring a sample text aiming at the initial abstract model and a sample label corresponding to the sample text; the sample tag is used for indicating the actual abstract of the sample text;

The prediction abstract determining module is used for inputting the sample text into the initial abstract model, and carrying out abstract prediction on the sample text through the initial abstract model to obtain a prediction abstract corresponding to the sample text;

the first training module is used for carrying out fine tuning training on the initial abstract model based on the actual abstract and the predicted abstract to obtain a second text abstract model; the second text summarization model is used to predict a text summary of the business text.

Wherein the summary setting number is M; m is a positive integer;

the apparatus further comprises:

the second building module is used for building an initial abstract model for processing the extraction abstract task based on the second pre-training model;

the second sample acquisition module is used for acquiring a sample text aiming at the initial abstract model, and respectively carrying out coding processing on X sentences in the sample text through the initial abstract model to obtain X semantic coding vectors; x is a positive integer;

the clustering module is used for carrying out clustering processing on the X semantic coding vectors based on the M initial cluster centers to obtain M clustering clusters;

the second training module is used for training the initial abstract model based on M clustering clusters to obtain a second text abstract model; the second text summarization model is used to predict a text summary of the business text.

Wherein the apparatus further comprises:

the query text determining module is used for determining query text in the service query request when the service query request is acquired; the service inquiry request is sent by the service terminal equipment;

the text abstract obtaining module is used for obtaining text abstracts corresponding to the H business texts respectively; h is a positive integer; the text abstract is obtained by invoking a second text abstract model and carrying out abstract prediction on a business text;

the similarity determining module is used for determining the text similarity between the query text and the H text summaries respectively and obtaining the service text corresponding to the text summary with the highest text similarity;

and the service data determining module is used for determining service data for returning to the service terminal equipment based on the acquired service text so as to enable the service terminal equipment to display the service data.

In one aspect, the application provides a computer device comprising: a processor, a memory, a network interface;

the processor is connected with the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program so as to enable the computer device to execute the method provided by the embodiment of the application.

In one aspect, the present application provides a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor, so that a computer device having the processor performs the method provided by the embodiment of the present application.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium; the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the method in the embodiment of the present application.

In the embodiment of the application, when the computer equipment acquires the original text including N sentences, the N sentences can be respectively divided based on the influence degree of each sentence (namely, the probability for indicating the current sentence to form the abstract) and the influence degree threshold value so as to obtain a positive sample set and a negative sample set corresponding to the original text. The computer device may then obtain a first pre-trained model (i.e., an untrained language characterization model) for determining a sample semantic vector for each sentence, respectively. The text semantic vector of the original text is determined based on the sample semantic vector corresponding to each sentence. The computer device may select sentences corresponding to the set number of summaries from the positive sample set as the first summary, and further determine a summary semantic vector corresponding to the first summary (i.e., a first summary semantic vector) based on sample semantic vectors corresponding to the sentences in the first summary. Similarly, the computer device may select sentences corresponding to the set number of summaries from the negative sample set as a second summary, and may further determine a summary semantic vector corresponding to the second summary (i.e., a second summary semantic vector) based on sample semantic vectors corresponding to the sentences in the second summary. At this time, the computer device may perform contrast learning on the first pre-training model based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector to obtain a second pre-training model, where the second pre-training model may be used to process the extraction abstract task. Therefore, the embodiment of the application provides a comparison learning-based extraction type text abstract pre-training method, which is characterized in that after a positive sample set and a negative sample set are constructed, a first abstract and a second abstract are required to be extracted from the positive sample set and the negative sample set respectively, and further comparison learning is carried out on a first pre-training model based on abstract semantic vectors of the two types of abstracts, which means that the second pre-training model obtained after the comparison learning is specially trained for an extraction type abstract task, so that the accuracy of abstract generation can be improved when the extraction type abstract task is processed later.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario for performing contrast learning on a first pre-training model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic view of a scenario in which influence prediction is performed based on a first text abstract model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a model structure of a text pre-training model according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a scenario associated with content production provided by an embodiment of the present application;

FIG. 8 is a schematic flow chart of summary prediction according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be appreciated that the embodiment of the application provides a contrast learning-based extraction type text abstract pre-training method, which is applied to the field of artificial intelligence. The extraction text abstract is a method for directly selecting a plurality of important sentences from the original text, and sequencing and recombining the sentences to form the abstract. Generally, the extraction-type text summaries can be divided into two main categories: unsupervised (self-supervised) extraction text summaries and supervised extraction text summaries.

Among them, artificial intelligence (Artificial Intelligence, abbreviated as AI) is a theory, method, technique and application system that simulates, extends and expands human intelligence by digital computer or calculation controlled by digital computer, senses environment, acquires knowledge and obtains an optimal result by using knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The contrast learning is self-supervision learning in deep learning, that is, independent of the annotation data, and needs to learn knowledge from the unmarked data set. The pretraining method generally puts together a large amount of training data collected at low cost, learns the commonalities thereof through a certain pretraining method, then 'transplants' the commonalities thereof into a model of a specific task (for example, an extraction type text abstract task), and then uses a small amount of labeling data in a related specific field for fine tuning, so that the pretraining method provides a good foundation for the model, and the model can learn a specific part of the specific task from the commonalities.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 10F and a cluster of terminal devices. The cluster of terminal devices may comprise one or more terminal devices, the number of which will not be limited here. As shown in fig. 1, the terminal device cluster may specifically include terminal devices 100a, 100b, 100c, …, and 100n. As shown in fig. 1, the terminal devices 100a, 100b, 100c, …, 100n may respectively perform network connection with the above-mentioned server 10F, so that each terminal device may perform data interaction with the server 10F through the network connection. The network connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may be directly or indirectly connected through a wireless communication manner, or may be other manners, which is not limited herein.

Wherein each terminal device in the terminal device cluster may include: smart terminals with data processing functions such as smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart televisions and the like. It should be understood that each terminal device in the cluster of terminal devices shown in fig. 1 may be provided with a service application (i.e. an application client), which may interact with the server 10F shown in fig. 1, respectively, when the application client is running in each terminal device. The application clients may include, among other things, social clients, multimedia clients (e.g., video clients), entertainment clients (e.g., game clients), information flow clients, educational clients, live clients, and the like. The application client may be an independent client, or may be an embedded sub-client integrated in a client (for example, a social client, an educational client, and a multimedia client), which is not limited herein.

As shown in fig. 1, the server 10F in the embodiment of the present application may be a server corresponding to the application client. The server 10F may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The embodiment of the application does not limit the number of servers.

For easy understanding, the embodiment of the present application may select one terminal device from a plurality of terminal devices shown in fig. 1 as a service terminal device. For example, the embodiment of the present application may use the terminal device 100a shown in fig. 1 as a service terminal device, where a service application (i.e., a client) may be integrated. At this time, the service terminal device may implement data interaction between the service data platform corresponding to the client and the server 10F.

The business application can run a target text abstract model, which can be used for abstract prediction of the text abstract of the business text, and is constructed based on a text pre-training model specially trained for the extraction type text abstract task. The text pre-training model may be a language representation model for encoding text data, for example, the text pre-training model may be a BERT model (Bidirectional Encoder Representation from Transformers), or may be another language representation model derived from the BERT model (for example, hiBERT model, roBERTa model, alBERT model, macBERT model, etc.), which will not be limited herein. The text pre-training model which is not trained can be called a first pre-training model, and the text pre-training model which is already trained can be called a second pre-training model.

In the embodiment of the present application, the computer device with the model pre-training function may be the server 10F shown in fig. 1, or may be any one of the terminal devices in the terminal device cluster shown in fig. 1, for example, the terminal device 100b, which will not be limited herein in specific form.

It should be appreciated that the computer device herein may train a first pre-training model (e.g., a BERT model) when it is acquired, using contrast learning. For example, when the computer device obtains an original text including N sentences, the N sentences need to be divided respectively based on the influence degree and the influence degree threshold value of each sentence, so as to obtain a positive sample set and a negative sample set corresponding to the original text. In the embodiment of the application, sentences in the positive sample set can be called positive samples, and sentences in the negative sample set can be called negative samples. Furthermore, the degree of influence of the sentence (i.e. the sentence importance score) may be used to indicate the probability that the current sentence constitutes a summary, i.e. the greater the degree of influence of the current sentence, the more important it means that the current sentence is, the more likely it is to be extracted to constitute a summary of the original text.

Further, the computer device may map each of the N sentences into the abstract semantic space through the first pre-training model, so as to obtain a sample semantic vector corresponding to each sentence. The text semantic vector corresponding to the original text is determined by the sample semantic vector of each sentence, for example, the text semantic vector is obtained by averaging the sample semantic vectors of the N sentences.

The computer device then needs to group positive and negative samples based on the positive and negative sample sets and the set number of digests (e.g., 3) to train the first pre-training model using contrast learning. Here, the number of digests set is M, where M is a positive integer, and the number of digests set by a business object (for example, a user) to form a digest may be dynamically adjusted according to actual situations, which will not be limited herein.

For example, the computer device may select sentences corresponding to the set number of digests from the positive sample set as the first digest (i.e., the positive sample digest), and may further determine a first digest semantic vector corresponding to the first digest based on sample semantic vectors corresponding to the sentences in the first digest. Similarly, the computer device may also select sentences corresponding to the set number of digests from the negative sample set as a second digest (i.e., a negative sample digest), and may further determine a second digest semantic vector corresponding to the second digest based on sample semantic vectors corresponding to the sentences in the second digest. At this time, the computer device may perform contrast learning on the first pre-training model based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector, thereby obtaining a second pre-training model.

Therefore, the second pre-training model is obtained after training the first pre-training model based on the abstract semantic vectors corresponding to the positive sample abstract and the negative sample abstract, which means that the second pre-training model is specially trained for the extraction type abstract task, so that the semantic vector of each sentence can be more accurately represented by the second pre-training model when the extraction type abstract task is processed later, and sentences which are consistent with the abstract setting number can be more accurately extracted based on the semantic vector of each sentence in the subsequent abstract generation process, so that the text abstract which can more accurately represent the business text (namely the text data of the abstract to be generated) is obtained, and the accuracy of abstract generation is improved.

For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for performing contrast learning on a first pre-training model according to an embodiment of the present application. As shown in fig. 2, the computer device in the embodiment of the present application may be any one of the terminal devices in the terminal device cluster shown in fig. 1, for example, the terminal device 100b, and the computer device may also be the server 10F shown in fig. 1, which is not limited herein.

As shown in fig. 2, the original text 20D obtained by the computer device may include N sentences, where N is a positive integer, and for convenience of explanation, N is 7 in the embodiment of the present application may be used as an example to describe a specific implementation of the comparison learning on the first pre-training model (for example, the pre-training model 20 m). For example, the sentence of the original text 20D may specifically include a sentence S ₁ Sentence S ₂ Sentence S ₃ Sentence S ₄ Sentence S ₅ Sentence S ₆ Sentence S ₇ 。

Further, the computer device may input the original text 20D into the pre-training model 20m, so that the pre-training model 20m maps each sentence in the original text 20D into an abstract semantic space, that is, performs encoding processing on each sentence to obtain a sample semantic vector corresponding to each sentence. Wherein the semantic vector corresponding to the original text 20D (i.e., the text semantic vector 20E shown in fig. 2) is determined based on the sample semantic vector corresponding to each sentence.

At the same time, the calculationThe machine device may obtain the influence degree (i.e., the sentence importance score) of each sentence, and may further divide the 7 sentences based on the influence degree of each sentence, respectively, to obtain positive and negative sample sets (for example, the positive sample set 2p and the negative sample set 2n shown in fig. 2) corresponding to the original text 20D. As shown in fig. 2, the positive sample set 2p may include a sentence S ₁ Sentence S ₂ Sentence S ₄ Sentence S ₇ . The negative sample set 2n may include sentences S ₃ Sentence S ₅ Sentence S ₆ 。

The computer device may then choose sentences from the positive sample set 2p that correspond to the set number of digests (e.g., 3) as the first digest (i.e., positive sample digest), i.e., the computer device may randomly extract 3 sentences from the positive sample set 2p as the first digest. Further, the computer device may determine a first abstract semantic vector corresponding to the first abstract based on sample semantic vectors corresponding to sentences in the first abstract.

For example, the computer device may extract sentences S from the positive sample set 2p ₁ Sentence S ₂ Sentence S ₄ These 3 sentences are taken as abstract 21Z ₁ And respectively determining sentences S from the N sample semantic vectors ₁ Sample semantic vector, sentence S ₂ Sample semantic vector of (a) and sentence S ₄ In turn, can determine summary 21Z based on the 3 sample semantic vectors obtained ₁ A corresponding first summarized semantic vector (e.g., summarized semantic vector 21E shown in FIG. 2) ₁ ). Similarly, since the first digest ultimately obtained by the computer device may include digest 21Z shown in FIG. 2 ₁ Abstract 21Z ₂ Summary 21Z ₃ Thus, the computer device can see abstract semantic vector 21E ₁ Determining the way of determining the abstract 21Z in turn ₂ A corresponding first summarized semantic vector (e.g., summarized semantic vector 21E shown in FIG. 2) ₂ ) And abstract 21Z ₃ A corresponding first summarized semantic vector (e.g., summarized semantic vector 21E shown in FIG. 2) ₃ )。

Similarly, the computer device may select sentences corresponding to the set number of digests (e.g., 3) from the negative sample set 2n as the second digest (i.e., negative sample digest), i.e., the computer device may randomly extract 3 sentences from the negative sample set 2n as the second digest. Further, the computer device may determine a second abstract semantic vector corresponding to the second abstract based on sample semantic vectors corresponding to sentences in the second abstract.

For example, since 3 sentences are included in the negative sample set 2n, the computer device can directly convert the 3 sentences in the negative sample set 2n (sentence S ₃ Sentence S ₅ Sentence S ₆ ) As abstract 22Z ₁ And respectively determining sentences S from the N sample semantic vectors ₃ Sample semantic vector, sentence S ₅ Sample semantic vector of (a) and sentence S ₆ In turn, can determine summary 22Z based on the 3 sample semantic vectors obtained ₁ A corresponding second summarized semantic vector (e.g., summarized semantic vector 22E shown in FIG. 2) ₁ )。

Finally, the computer device may be based on the first abstract semantic vector (i.e., abstract semantic vector 21E ₁ Abstract semantic vector 21E ₂ Abstract semantic vector 21E ₃ ) A second summary semantic vector (i.e., summary semantic vector 22E ₁ ) And text semantic vector 20E, contrast learning is performed on pre-trained model 20m, such that a second pre-trained model (e.g., pre-trained model 21m shown in fig. 2) may be derived. Therefore, the pre-training model 21m is specially trained for the extracted abstract task, which means that each sentence in the text data (i.e. business text) to be generated abstract can be accurately represented through the pre-training model 21m when the extracted abstract task is processed, so that the probability that each sentence is extracted as an abstract (i.e. the influence degree of each sentence) can be known more accurately when the abstract is predicted based on the semantic vector of each sentence, and further, the text abstract for representing the business text (i.e. the text data to be generated abstract) can be generated more accurately, thereby improving the accuracy of abstract generation.

After constructing positive and negative sample summaries (including a first summary and a second summary), the computer device performs contrast learning on the first pre-training model through summary semantic vectors corresponding to the positive and negative sample summaries and text semantic vectors corresponding to the original text, so as to obtain a specific implementation manner of the second pre-training model for processing the extracted summary task, which can be seen in the embodiments corresponding to fig. 3-8 below.

Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 3, the method may be performed by a computer device, which may be a terminal device (e.g., any one of the terminal devices in the terminal device cluster shown in fig. 1, e.g., the terminal device 100 b) or a server (e.g., the server 10F shown in fig. 1), which is not limited herein. For ease of understanding, embodiments of the present application will be described with the method being performed by a server as an example, and the method may include at least the following steps S101 to S105:

step S101, when an original text comprising N sentences is obtained, dividing the N sentences based on influence degree and influence degree threshold value of each sentence to obtain a positive sample set and a negative sample set corresponding to the original text.

Specifically, the computer device may acquire the influence degree (i.e., sentence importance score) of each sentence, respectively, when acquiring the original text including N sentences. The influence of a sentence may be manually evaluated by an evaluation object (e.g., a user with a specialized experience) for the sentence, or may be intelligently evaluated by the computer device for the sentence based on a text summarization model (i.e., a first text summarization model), which will not be limited herein. The influence degree of a sentence is used to indicate the probability that the sentence constitutes a summary, i.e. the larger the probability, the more important the current sentence is, the more likely it is to be extracted. Further, the computer device may divide the N sentences based on the influence degree of each sentence and the influence degree threshold, so as to obtain a positive sample set and a negative sample set corresponding to the original text.

The first text abstract model may be a trained supervised extraction text abstract model, and the first text abstract model is mainly used for predicting importance scores of each sentence, i.e. considering context information of each sentence in an original text, so as to more accurately predict influence of each sentence. The first text excerpt model may be any one of a BERT model or a derivative model of the BERT model, and a model structure of the first text excerpt model will not be defined herein.

It should be understood that, when an original text including N sentences is obtained, the computer device may obtain a first text abstract model, and further input the N sentences into the first text abstract model, and perform abstract prediction on each sentence through the first text abstract model to obtain an influence degree of each sentence, and further divide the N sentences based on the influence degree of each sentence and an influence degree threshold, so as to obtain a positive sample set and a negative sample set corresponding to the original text.

Because the first text abstract model can include the text encoder and the text decoder, when the N sentences are input into the first text abstract model, the computer equipment can acquire the first sentence representations corresponding to each sentence respectively to obtain N first sentence representations, and further can encode each sentence respectively through the text encoder and the N first sentence representations to obtain the second sentence representations corresponding to each sentence respectively.

Wherein the N sentences include sentences S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N; the N first sentence representations comprise sentences S _i Corresponding first sentence representation E _i . Then, when the text encoder includes a first text encoder and a second text encoder, the computer device may characterize E by the first text encoder and the first sentence _i Sentence S _i Performing a first encoding process to obtain a sentence S _i A corresponding first encoded vector. When obtaining the first code vectors corresponding to N sentences respectively, the computer equipment can make N first code vectorsInput to a second text encoder for sentence S by the second text encoder and N first encoding vectors _i Performing a second encoding process to obtain a sentence S _i A corresponding second encoded vector. At this time, the computer device may make sentence S _i The corresponding second encoded vector is used as sentence S _i Corresponding second sentence representation D _i 。

Then, the computer device may input N second sentence representations to the text decoder, and decode the N sentences respectively through the text decoder and the N second sentence representations to obtain an influence degree of each sentence.

For ease of understanding, further, please refer to fig. 4, fig. 4 is a schematic view of a scene of performing influence prediction based on a first text summarization model according to an embodiment of the present application. As shown in fig. 4, the text summarization model 40m (i.e., the first text summarization model) in the embodiment of the present application may be exemplified by the HiBERT model, to illustrate a specific implementation of predicting the influence of sentences. As shown in fig. 4, the model structure of the text summarization model 40m may include a text encoder 41Q and a text decoder 42Q. Wherein the text encoder 41Q may include a text encoder Q ₁ (e.g., mount Encoder) and text Encoder Q ₂ (e.g., doc Encoder).

It will be appreciated that the text encoder Q ₁ An encoder belonging to sentence level for encoding the inputted characters; text encoder Q ₂ An encoder belonging to the document level for encoding the input sentence; the text decoder 42Q is configured to determine whether each sentence is a sentence constituting a digest, i.e., to determine the influence degree of each sentence, in combination with the context. Specifically, the manner of determining the influence of sentences can be found in the following formula (1):

p(S _i |D)＝softmax(W ^S D _i ) (1)

wherein p (S) _i I D) represents sentence S _i Probability distribution (i.e. sentence importance score, also called influence), i.e. for indicating sentence S _i Probability of text excerpt for composing the original text (i.e., text D); here, theS _i I is a positive integer less than or equal to N, and N is the total number of sentences in the original text; d (D) _i For representing the sentence S output by the text encoder 41Q _i I.e. the second sentence representation, i.e. the text encoder Q shown in fig. 4 ₂ Output sentence S _i Sentence representation of (a); w (W) ^S For representing model parameters to be learned by text summary model 40 m.

It should be appreciated that the original text obtained by the computer device may be the original text 40D shown in fig. 4, where the original text 40D may include N sentences, where N is 4 for example, and may include, in particular, the sentence S for ease of understanding ₁ Sentence S ₂ Sentence S ₃ Sentence S ₁ . The computer device may then input these 4 sentences together into the text summarization model 40m shown in fig. 4, with the text encoder 41Q and the text decoder 42Q in the text summarization model 40m performing a summary prediction for each sentence separately.

Wherein, when the computer device encodes N sentences through the text encoder 41Q, the computer device can acquire the first sentence representations corresponding to the N sentences respectively, and the first sentence representations can include a sentence S ₁ First sentence representation E ₁ Sentence S ₂ First sentence representation E ₂ Sentence S ₃ First sentence representation E ₃ Sentence S ₄ First sentence representation E ₄ 。

For example, in sentence S ₁ For example, the sentence S ₁ May include a plurality of characters, the first sentence characterizing E ₁ May be based on the sentence S ₁ The character representation of each character of the (c) is determined. The character representation of a character may be determined jointly by the word representation (Token symbols), the segment representation (Segment Embeddings), and the character position representation (Position Embeddings) corresponding to the character.

Due to the text encoder Q ₁ Can fully consider each character in sentence S ₁ In the computer device, therefore, in characterizing the first sentence E ₁ Input to text encoder Q ₁ When passing through the text encoder Q ₁ And first sentence representation E ₁ Sentence S ₁ Performing coding processing to obtain sentence S ₁ A corresponding first encoded vector. Similarly, the computer device inputs the other three first sentence representations into the text encoder Q ₁ Thereafter, sentence S can be obtained ₂ Corresponding first code vector, sentence S ₃ Corresponding first encoded vector and sentence S ₄ A corresponding first encoded vector.

Further, the computer device may take the 4 first encoding vectors as the text encoder Q ₂ To pass through text encoder Q ₂ The context information of each sentence in the original text 40D can be fully considered. For example, the computer device is inputting 4 first encoding vectors into the text encoder Q ₂ In this case, the text encoder Q may be passed through ₂ And 4 first code vectors for sentence S ₁ Performing a second encoding process to obtain a sentence S ₁ A corresponding second encoded vector. At this time, the computer device may make sentence S ₁ The corresponding second encoded vector is used as sentence S ₁ Corresponding second sentence representation D ₁ . Wherein the second sentence represents D ₂ Can more accurate representation this sentence S ₁ Semantic information of (a). Similarly, the computer device may pass through a text encoder Q ₂ And 4 first code vectors, for sentence S in turn ₂ Sentence S ₃ Sentence S ₄ Performing a second encoding process to obtain a sentence S ₂ Corresponding second sentence representation D ₂ Sentence S ₃ Corresponding second sentence representation D ₃ Sentence S ₄ Corresponding second sentence representation D ₄ . It will be appreciated that these 4 second sentence representations are the corresponding output features of the text encoder 41Q.

The computer device may then input the 4 second sentence representations to the text decoder 42Q, and decode each of the 4 sentences through the text decoder, the 4 second sentence representations, and the above formula (1), respectively, to obtain an influence degree of each sentence.

Further, the computer device may obtain an influence threshold value for distinguishing positive and negative samples, and further may divide N sentences based on the influence degree of each sentence and the influence threshold value, so as to obtain a positive sample set and a negative sample set corresponding to the original text. Wherein the influence threshold here may be related to the total number of sentences N of the original text.

If the total number of sentences N of the original text is smaller than the threshold number of sentences (e.g., 10), the total number of sentences of the original text may be considered to be smaller, and in order to ensure that enough samples can be obtained for model training, the computer device needs to divide all sentences into positive and negative sample sets corresponding to the original text, where the threshold number of influence obtained by the computer device for the original text may include a threshold, i.e., a first threshold (e.g., 0.7). The first threshold may be dynamically adjusted according to the actual situation, which will not be limited here.

For example, the computer device may traverse N sentences, determining the traversed sentences as sentences to be divided. If the influence degree of the sentence to be divided is greater than or equal to the first threshold, the computer device may determine the sentence to be divided as an important sentence (i.e., a positive sample sentence), and further may add the sentence to be divided to a positive sample set corresponding to the original text. Optionally, if the influence degree of the sentence to be divided is smaller than the first threshold, the computer device may determine the sentence to be divided as an irrelevant sentence (i.e. a negative sample sentence), and further may add the sentence to be divided to the negative sample set corresponding to the original text.

If the total number N of sentences of the original text is greater than or equal to the threshold number of sentences (e.g., 10), the total number N of sentences of the original text may be considered to be greater, and in order to effectively improve the efficiency of model training, in the embodiment of the present application, all sentences of the original text need not be used as sample sentences to participate in model training of the first pre-training model, but part of sentences with insignificant importance may be filtered, and then the filtered sentences may be divided into positive and negative sample sets corresponding to the original text. This means that the impact threshold that the computer device acquires for the original text may comprise two thresholds, namely a second threshold (e.g. 0.8) and a third threshold (e.g. 0.2). Wherein, the second threshold is greater than the third threshold, and the second threshold and the third threshold can be dynamically adjusted according to the actual situation, which will not be limited herein.

For example, the computer device may traverse N sentences, determining the traversed sentences as sentences to be divided. If the influence degree of the sentence to be divided is greater than or equal to the second threshold value, the computer device may determine the sentence to be divided as an important sentence (i.e., a positive sample sentence), and further may add the sentence to be divided to the positive sample set corresponding to the original text. Optionally, if the influence degree of the sentence to be divided is smaller than the second threshold and larger than the third threshold, the computer device may determine the sentence to be divided as a sentence with insignificant importance (i.e. a sentence to be filtered), and may further filter the sentence to be divided. Optionally, if the influence degree of the sentence to be divided is less than or equal to the third threshold, the computer device may determine the sentence to be divided as an irrelevant sentence (i.e. a negative sample sentence), and further may add the sentence to be divided to the negative sample set corresponding to the original text.

Step S102, a first pre-training model is acquired.

Specifically, after dividing N sentences into positive and negative sample sets, the computer device may obtain a language characterization model for encoding text data, and may further refer to the untrained language characterization model as a first pre-training model. For example, the first pre-training model herein may be a BERT model, or may be another language characterization model derived from the BERT model, which will not be limited herein. Wherein the first pre-training model may be used to determine a sample semantic vector for each sentence, respectively. The text semantic vector corresponding to the original text is determined by the sample semantic vector corresponding to each sentence.

Wherein the N sentences include sentences S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N. Upon acquisition of the first pre-trained modelThe computer device can make sentence S _i Input to a first pre-training model, which may be further based on sentences S _i Character characterization of each character in (1) to determine sentence S _i A corresponding initial sample vector. Wherein the character representation of a character is commonly determined by the word representation, the segment representation and the character position representation corresponding to the character. Further, the computer device may pass through the first pre-training model and sentence S _i Corresponding initial sample vector, sentence S _i Encoding processing is performed so that sentences S can be obtained _i Corresponding sample semantic vectors.

For ease of understanding, further, please refer to fig. 5, fig. 5 is a schematic diagram of a model structure of a text pre-training model according to an embodiment of the present application. As shown in fig. 5, a text pre-training model 50m (i.e., a first pre-training model) in an embodiment of the present application may be taken as an example of a BERT model, to illustrate a specific implementation of the encoding process performed on each sentence by the first pre-training model. As shown in FIG. 5, the BERT model is a transducer-based deep bi-directional language characterization model, essentially a multi-layer bi-directional Encoder (Encoder) network constructed using a transducer structure, and is a Self-attention-based deep model. Two tasks in model training are predicting masked words in sentences and determining whether two sentences are top and bottom sentences that are input.

As shown in FIG. 5, the overall framework of the BERT model is formed by stacking multiple layers of transformers' encoders, e.g., N as shown in FIG. 5 _X The same layers are stacked together, N _X Is a positive integer. Each layer's encoder has two sub-layers, one of which may be a layer of a multi-head-attention layer (muti-head-attention) and a normalization layer, and the other of which may be a layer of a simple fully-connected feed-forward network (i.e., a position-by-position forward propagation layer) and a normalization layer. Wherein the main function of each self-attention layer is to pass through the target character (i.e. sentence S _i A certain character of (a) sentence S _i The target character is recoded by the relatedness of all the characters in the table. I.e. so each of the attitudesThe calculation includes three steps: and calculating the correlation degree between any two characters, normalizing the correlation degree, and carrying out weighted summation on the correlation degree and the codes of all the characters to obtain the code vector of the target character.

Wherein, sentence S shown in FIG. 5 _i Refers to the ith sentence in the original text acquired by the computer equipment, due to sentence S _i Comprising K characters, thus for the sentence S _i For any one of the characters (e.g., character j), the computer device needs to obtain a word representation (Token symbols) of the character j, a segment representation (Segment Embeddings) of the character j, and a character position representation (Position Embeddings) of the character j, and then may perform a summation process on these three different types of representations to obtain a character representation of the character j. Wherein j is a positive integer less than or equal to K.

Where word representation is understood to mean a vector representation of the character itself, for example, the computer device may divide the character into a limited set of common sub-word units, which may strike a compromise between the effectiveness of the character and the flexibility of the character. The segment representation is used to distinguish between vector representations of two sentences, i.e., the segment standard for each character in the same sentence is the same. Character position characterization refers to encoding position information of characters into feature vectors. The character position representation is selected from various forms, namely learned and fixed, for example, the character position representation can be obtained by training an initialized position vector through a BERT model, and can be constructed through a specified rule through a transducer, and the character position representation is not limited herein.

Further, the computer device obtains the character representation (i.e. character representation E ₁ Character representation E ₂ … and character representation E _k ) The computer device may refer to the vector formed by the K character representations as sentence S _i Which in turn can be used as an input feature for input to the BERT model to propagate forward position by position through Nx multiple head self-attention layers, normalization layers in the BERT modelLayer and normalization layer, respectively for sentence S _i Each character in the sentence is encoded, so that the encoding vector of each character can be obtained, and the encoding vectors corresponding to the K characters can be determined as sentence S _i Corresponding sample semantic vectors.

By analogy, the computer device can see the acquired sentence S _i And sequentially acquiring sample semantic vectors corresponding to N sentences in the original text in a specific mode of the corresponding sample semantic vectors. Further, the computer device may determine a text semantic vector corresponding to the original text based on the N sample semantic vectors. For example, the text semantic vector is obtained by averaging N sample semantic vectors, and for example, the text semantic vector may be obtained by summing N sample semantic vectors, which will not be limited herein.

Step S103, selecting sentences which are consistent with the set quantity of the summaries from the positive sample set as a first summary, and determining a first summary semantic vector corresponding to the first summary based on sample semantic vectors corresponding to the sentences in the first summary.

Specifically, the computer device may select sentences conforming to the set number of digests from the positive sample set as the first digest (i.e., the positive sample digest), and then the computer device may obtain sample semantic vectors of the sentences in the first digest from the N sample semantic vectors, and may further perform vector processing (for example, average processing or summation processing) on the obtained sample semantic vectors, so as to obtain first digest semantic vectors corresponding to the first digest.

It can be appreciated that when the number of digests is 3, if the number of sentences in the positive sample set is less than or equal to the number of digests, the computer device may directly take all sentences in the positive sample set as the first digests. For example, if the positive sample set includes 2 sentences, the computer device may obtain a positive sample digest from the positive sample set, i.e., the positive sample digest is made up of 2 sentences in the positive sample set.

Optionally, an optionalIf the number of sentences in the positive sample set is greater than the set number of digests, the computer device may randomly extract 3 sentences from the positive sample set as the first digest. For example, as shown in FIG. 2, the positive sample set 2p includes a sentence S ₁ Sentence S ₂ Sentence S ₄ Sentence S ₇ These 4 sentences, therefore, the computer device can extract sentences S from the positive sample set 2p ₁ Sentence S ₂ Sentence S ₄ These 3 sentences are taken as abstract 21Z ₁ Sentences S may also be extracted from the positive sample set 2p ₁ Sentence S ₂ Sentence S ₇ These 3 sentences are taken as abstract 21Z ₂ Sentences S can also be extracted from the positive sample set 2p ₂ Sentence S ₄ Sentence S ₇ These 3 sentences are taken as abstract 21Z ₃ . Based on this, the first digest finally extracted by the computer device from the positive sample set may include 3, i.e., digest 21Z ₁ Abstract 21Z ₂ Summary 21Z ₃ 。

Step S104, selecting sentences which are consistent with the set quantity of the abstracts from the negative sample set as a second abstract, and determining a second abstract semantic vector corresponding to the second abstract based on sample semantic vectors corresponding to the sentences in the second abstract.

Specifically, the computer device may select sentences conforming to the set number of digests from the negative sample set as the second digest (i.e., the negative sample digest), and then the computer device may obtain sample semantic vectors of the sentences in the second digest from the N sample semantic vectors, and may further perform vector processing (for example, average processing or summation processing) on the obtained sample semantic vectors, so as to obtain second digest semantic vectors corresponding to the second digest. It can be understood that the manner in which the computer device obtains the summary semantic vectors (i.e., the first summary semantic vector and the second summary semantic vector) is the same as the manner in which the text semantic vector is obtained, for example, if the text semantic vector is obtained by averaging N sample semantic vectors, then the first summary semantic vector is obtained by averaging the sample semantic vectors of the sentences in the first summary, and the second summary semantic vector is also obtained by averaging the sample semantic vectors of the sentences in the second summary. In other words, both the first summary semantic vector and the second summary semantic vector may be average summary semantic vectors; the average abstract semantic vector is obtained by averaging sample semantic vectors of each sentence in a sentence set, and the sentence set may include a first abstract and a second abstract.

It can be appreciated that when the number of summaries is 3, if the number of sentences in the negative set of samples is less than or equal to the number of summaries, the computer device may directly use all sentences in the negative set of samples as the second summary. As shown in fig. 2, since the negative sample set 2n may include a sentence S ₃ Sentence S ₅ Sentence S ₆ The 3 sentences, therefore, the computer device directly takes all sentences in the negative sample set 2n as the second abstract (i.e., abstract 22Z ₁ ). Alternatively, if the number of sentences in the negative set is greater than the set number of digests, the computer device may randomly extract 3 sentences from the negative set as the second digest. The embodiment of the computer device extracting the second abstract from the negative sample set may be referred to as the embodiment of extracting the first abstract from the positive sample set, which is not limited herein.

Step S105, based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector, performing contrast learning on the first pre-training model to obtain a second pre-training model.

Specifically, the computer device may obtain a model loss function for comparison learning, and may further determine model loss corresponding to the model loss function based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector. Then, the computer device may train the first pre-training model based on model loss to obtain a model training result, and may further obtain a second pre-training model based on the model training result. Wherein the second pre-training model may be used to process the extracted summary task.

Wherein the model loss function obtained by the computer device can be referred to as the following formula (2):

wherein L is _i Indicating a sub-loss corresponding to an ith first abstract extracted from the positive sample set, wherein i is a positive integer less than or equal to X, and X is the number of abstracts of the first abstract extracted from the positive sample set by the computer equipment;the text semantic vector is used for representing the text corresponding to the original text; z is Z _i A first abstract semantic vector for representing an ith first abstract (i.e., abstract i); k is used for representing the number of abstracts of the second abstracts extracted from the negative sample set by the computer equipment; z is Z _j A second digest semantic vector corresponding to a j-th second digest (i.e., digest j) extracted from the negative sample set; τ is used to represent the temperature coefficient for controlling the degree of discrimination of the positive and negative samples by the first pre-trained model.

Wherein the computer device may obtain a model convergence condition associated with the first pre-trained model, wherein the model convergence condition may be that model loss does not continue to drop for N (e.g., 10) rounds, i.e., model training is stopped. Alternatively, the model convergence condition may be that the model loss is smaller than a loss threshold in the model convergence condition, i.e., model training is stopped. It will not be limited here.

It may be appreciated that if the model training result indicates that the trained first pre-training model meets the model convergence condition, the computer device may directly take the first pre-training model meeting the model convergence condition as the second pre-training model. Optionally, if the model training result indicates that the trained first pre-training model does not meet the model convergence condition, the computer device may adjust model parameters of the first pre-training model based on a model loss function that does not meet the model convergence condition, and further may train the transition model by using the first pre-training model after adjusting the model parameters as the transition model, until the trained transition model meets the model convergence condition, and use the transition model meeting the model convergence condition as the second pre-training model.

For example, the computer device may use a transition model to re-encode N sentences, so as to use the newly obtained encoding vectors corresponding to each sentence, further use the obtained N encoding vectors as new sample semantic vectors corresponding to the N sentences, and then re-determine, based on the N new sample semantic vectors, a first abstract semantic vector corresponding to the first abstract (i.e., a first update abstract vector), and a second abstract semantic vector corresponding to the second abstract (i.e., a second update abstract vector), and a text semantic vector corresponding to the original text (i.e., a text update vector), where the computer device may re-determine, based on the above formula (2), the first update abstract vector, the second update abstract vector, and the text update vector, a model loss of the transition model, so as to train the transition model, and use the transition model satisfying the model convergence condition as the second pre-training model until the trained transition model satisfies the model convergence condition.

In the embodiment of the application, the computer equipment can divide N sentences into the positive and negative sample sets through the influence degree of each sentence in the original text so as to facilitate the subsequent extraction of the first abstract and the second abstract from the positive and negative sample sets respectively. And then when determining the abstract semantic vectors and the text semantic vectors of the two types of abstracts, training a first pre-training model by adopting a contrast learning method so as to learn general features of the data set by enabling the first pre-training model to learn which data points are similar or different under the condition of no label, thereby improving the accuracy of text representation and further improving the accuracy of abstract generation when the extracted abstract task is processed later.

Further, referring to fig. 6, fig. 6 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a terminal device having a model training function (for example, any one of the terminal devices in the terminal device cluster shown in fig. 1, for example, the terminal device 100 b), may be performed by a server having a model training function (for example, the server 10F shown in fig. 1), or may be performed interactively by a terminal device having a model application function and a server having a model training function, which is not limited herein. For easy understanding, the embodiment of the present application is illustrated by taking a computer device, such as a server, and the method may at least include the following steps S201 to S207:

In step S201, when an original text including N sentences is obtained, the N sentences are respectively divided based on the influence degree and the influence degree threshold value of each sentence, so as to obtain a positive sample set and a negative sample set corresponding to the original text.

Step S202, a first pre-training model is acquired.

Step S203, selecting sentences corresponding to the set number of summaries from the positive sample set as the first summary, and determining the first summary semantic vector corresponding to the first summary based on the sample semantic vectors corresponding to the sentences in the first summary.

Step S204, selecting sentences which are consistent with the set quantity of the abstracts from the negative sample set as a second abstract, and determining a second abstract semantic vector corresponding to the second abstract based on sample semantic vectors corresponding to the sentences in the second abstract.

Step S205, based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector, performing contrast learning on the first pre-training model to obtain a second pre-training model.

Specifically, the computer device may obtain a model loss function for comparison learning, and may further determine model loss corresponding to the model loss function based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector. Then, the computer device may train the first pre-training model based on model loss to obtain a model training result, and may further obtain a second pre-training model based on the model training result.

The specific implementation of the steps S201 to S205 may be referred to the description of the steps S101 to S105 in the embodiment corresponding to fig. 3, and will not be repeated here.

Step S206, constructing an initial abstract model for processing the extracted abstract task based on the second pre-training model.

Specifically, the computer device may build other network layers on the basis of the second pre-training model to build an initial summary model for processing the extracted summary task. For example, the computer device may directly add a web layer (e.g., a decoding layer) for predicting the influence of sentences after the second pre-training model to obtain an initial summary model of a type. For another example, to efficiently characterize sentence vectors, the computer device may add a second pre-training model followed by a network layer (e.g., a decoding layer) for predicting sentence impact to obtain another type of initial summary model. That is, the model structure of the initial summary model herein may include one or more second pre-trained models, and embodiments of the present application will not specifically limit the model structure of the initial summary model.

Step S207, sample text for the initial abstract model is obtained, and training is carried out on the initial abstract based on the sample text to obtain a second text abstract model.

Among these, it can be appreciated that since the extracted text summaries can be divided into two main categories: supervised and unsupervised (self-supervised) extraction text summaries. Thus, when training the initial summary model, two types of training may be included, one being supervised training and the other being unsupervised training.

For supervised training, the computer device may quickly train to obtain a second text summarization model for summarization prediction based on a small amount of annotation data. For example, when the computer device acquires the sample text, the sample label corresponding to the sample text can be acquired, where the sample label can be used to indicate an actual abstract of the sample text. Further, the computer equipment can input the sample text into an initial abstract model, and abstract prediction is carried out on the sample text through the initial abstract model to obtain a predicted abstract corresponding to the sample text. The computer device may then fine tune the initial abstract model based on the actual abstract and the predicted abstract to obtain a second text abstract model.

It can be understood that when the sample text is input to the initial abstract model, the computer device may perform encoding processing on each sentence in the sample text based on the second pre-training model in the initial abstract model to obtain a sample semantic vector corresponding to each sentence, further may perform linear conversion on each sample semantic vector based on other network layers in the initial abstract model to obtain a degree of influence corresponding to each sentence, and further may perform descending order sorting processing on each sentence based on the degree of influence corresponding to each sentence, so as to sequentially obtain sentences conforming to the set number of abstracts (for example, M is 3) from the sorting result as a prediction abstract, that is, obtain 3 sentences with earlier ordering as the prediction abstract.

For unsupervised training, the embodiment of the application can train in a clustering process or a topological graph construction mode. This may be illustrated by way of example in a clustering process. For example, the number of digests set here may be M, where M is a positive integer. When the computer equipment obtains a sample text aiming at an initial abstract model, X sentences in the sample text can be respectively encoded through the initial abstract model to obtain X semantic encoding vectors, wherein X is a positive integer. The computer device may then perform a clustering process on the X semantic code vectors based on the M initial cluster centers to obtain M clusters. Further, the computer device may train the initial abstract model based on the M clusters to obtain a second text abstract model for predicting a text abstract of the business text.

For example, when obtaining M clusters, the computer device may use a K-means clustering algorithm, for example, the computer device may randomly define M initial cluster centers, and determine each semantic coding vector of the X semantic coding vectors to be clustered, so as to determine distances between the to-be-clustered vectors and the M initial cluster centers, and then allocate the to-be-clustered vectors to the initial cluster center with the minimum distance, after completing the allocation of the X semantic coding vectors, may obtain M initial clusters, and further update the M initial cluster centers based on the centers of the M initial clusters until the updated cluster center position is no longer significantly changed, that is, when meeting the convergence condition of the clustering algorithm, obtain M clusters.

It will be appreciated that the second text summarization model in the embodiments of the present application may be applied in a media production scenario, for example, in a media intelligence center (i.e., a media AI center), where the media intelligence center may store a large amount of media data, where the media data may include video clips and speech recognition (Automatic Speech Recognition, abbreviated ASR) text corresponding to the video clips, but most video clips lack self-contained titles and clip summaries, and such information is essential in video generation and cataloging.

Therefore, after training to obtain the second text abstract model, the computer equipment can obtain the voice recognition text needing abstract prediction from the media intelligent center as a service text, further can input the service text into the second text abstract model, and can predict the influence degree of each sentence in the service text through the second text abstract model, so as to obtain the influence degree corresponding to each sentence. Further, the computer device may perform a descending order sorting process on each sentence based on the influence degree corresponding to each sentence to sequentially obtain sentences conforming to the set number of digests (for example, M is 3) from the sorting result as the text digests of the business text. Of course, the computer device may also obtain the sentence with the highest influence in the ranking result, and further determine the text heading of the business text based on the sentence with the highest influence. For example, the computer device may directly take the sentence with the highest influence degree as the text title of the service text, and for example, the computer device may directly perform keyword extraction on the sentence with the highest influence degree to splice to obtain the text title of the service text, where the determination manner of the text title of the service text is not limited.

Based on this, the second text summarization model in the embodiment of the present application may be used to predict a text summary of a business text, and may also be used to predict a text title of a business text, which will not be limited herein.

For ease of understanding, further, please refer to fig. 7, fig. 7 is a schematic diagram of a scenario associated with content production provided by an embodiment of the present application. As shown in fig. 7, the computer device in the embodiment of the present application may be a server 70F, and the server 70F may be the server 10F shown in fig. 1. The terminal device 700c in the embodiment of the present application may be a terminal device used by a content production object (i.e., a content production device), and the terminal device 700c may be any one of the terminal device clusters shown in fig. 1 described above, for example, the terminal device 100c, which will not be limited herein.

It should be appreciated that when media data lacking a headline digest is displayed on the terminal interface of the terminal device 700c, the content production object may perform a triggering operation on the media data such that, in response to the triggering operation, 700c generates a content production request for speech recognition text of the media data (e.g., business text 70D shown in fig. 7), which may in turn be sent to the server 70F. The triggering operation may include non-contact operation such as voice, gesture, etc., or contact operation such as clicking, long pressing, etc., which will not be limited herein.

Upon receiving the content production request, the server 70F may obtain service text 70D carried by the content production request. At the same time, the computer device needs to invoke text abstract model 70m. Wherein the text summary model 70m herein is a web model constructed based on the second pre-training model for processing the extracted summary. Further, the server 70F may input the service text 70D into the text abstract model 70m, and may further predict the influence degree of each sentence in the service text 70D through the text abstract model 70m, so as to obtain the influence degree corresponding to each sentence. Further, the server 70F may perform a descending order sorting process on each sentence based on the degree of influence corresponding to each sentence to sequentially obtain sentences conforming to the set number of digests (for example, M is 3) from the sorting result as the text digests of the business text 70D, that is, obtain the top 3 sentences as the text digests (for example, the text digests 70D shown in fig. 7). Meanwhile, the server 70F may further obtain the sentence with the highest influence in the ranking result, and then directly determine the sentence with the highest influence as the text title of the service text 70D (for example, the text title 70T shown in fig. 7).

Further, the server 70F may determine content production data for the service text 70D based on the text summary 70D, the text heading 70T, and store it. For example, the content production data may include, in addition to text summary 70D and text heading 70T, a clip type, a summary tag, a core event (i.e., including time, place, etc.), and an acquisition object (e.g., news reporter, etc.) of a video clip of business text 70D, which will not be illustrated herein. Then, the server 70F may transmit the content production data to the terminal device 700c to cause the terminal device 700c to display it.

Therefore, after the computer equipment learns to obtain the second pre-training model through the contrast-learned extraction type text abstract pre-training method, the second pre-training model can be applied to an extraction type text abstract task so as to automatically extract intelligent titles and segment abstracts from voice recognition (Automatic Speech Recognition, ASR for short) texts in video segments, thereby improving the efficiency and accuracy of content production.

On the other hand, as important text modal data, the title and the segment abstract can help the multi-modal search engine to understand the content of the video segment so as to effectively optimize the search effect, thereby improving the richness and accuracy of data recommendation.

For example, the computer device may determine query text in the service query request when the service query request is obtained, where the service query request is sent by a service terminal device corresponding to a service object (e.g., a query user). At this time, the computer device may obtain text summaries corresponding to the H service texts, where H is a positive integer. Wherein, a text abstract is obtained by the computer device performing abstract prediction on a business text by calling a second text abstract model. Further, the computer device may determine the text similarity between the query text and the H text summaries, so as to obtain a service text corresponding to the text summary with the highest text similarity, and determine service data for returning to the service terminal device based on the obtained service text, so that the service terminal device displays the service data.

The service data herein may be a service text corresponding to the summary with the highest text similarity (i.e. a matching text corresponding to the query text), or may also include data of other modes corresponding to the matching text (e.g. a video clip corresponding to the matching text, an audio clip corresponding to the matching text, a book corresponding to the matching text, picture data corresponding to the matching text, etc.), where the service data returned to the service terminal device will not be limited.

Therefore, the second text abstract model in the embodiment of the application plays an important role in the content production scene and the data recommendation scene, and because the second text abstract model is constructed through the second pre-training model, the second pre-training model can be applied to scenes such as text classification, machine translation and the like, and can also be applied to scenes such as abstract prediction, title prediction and the like, and the application range of the second text pre-training model is greatly expanded.

For ease of understanding, further, please refer to fig. 8, fig. 8 is a schematic flow chart of summary prediction according to an embodiment of the present application. As shown in fig. 8, when training the first pre-training model by using contrast learning, the computer device in the embodiment of the present application involves two models, one is a supervised extraction type text abstract model (text abstract model 80m shown in fig. 8, for example, hiBERT model) for predicting the influence of sentences, and the other is a pre-training model (pre-training model 81m shown in fig. 8, for example, BERT model) for performing encoding processing on sentences. The flow in the embodiment of the application mainly comprises the following four steps:

The first step, namely, outputting the influence degrees (i.e. importance scores) corresponding to the N sentences in the original text 80D through the supervised extraction type text abstract model (Supervised Extractive Summarization Model, i.e. the first text abstract model), and sorting (Sort Sentences by Salient Scores) according to the N influence degrees to obtain an influence degree sorting result. Wherein N is a positive integer.

In the second step, N sentences are respectively divided by the influence degree sequencing result and the influence degree threshold value to determine positive and negative sample sentences (Salient Sentences & Irrelevant Sentences, positive Summaries & Negative Summaries), and then after the division is completed, the positive sample set 8p and the negative sample set 8N shown in fig. 8 are obtained.

Third, a first Pre-training Model (contrast Pre-training Model) is trained by the contrast learning task to obtain a Pre-training Model 81m (i.e., a second Pre-training Model). For example, the computer device may determine a sample semantic vector for each sentence when an untrained pre-training model 81m (i.e., a first pre-training model) is obtained. Meanwhile, the computer device may select sentences corresponding to the digest-set number from the positive sample set 8p as the first digest (for example, a digest 81Z shown in fig. 8 ₁ Sum abstract 81Z ₂ ) Further, the first abstract semantic vector corresponding to the first abstract may be determined based on the sample semantic vector corresponding to the sentence in the first abstract. Similarly, the computer device may select sentences corresponding to the set number of digests from the negative sample set 8n as the second digest (for example, the digest 82Z shown in fig. 8), and may further determine a second digest semantic vector corresponding to the second digest based on sample semantic vectors corresponding to the sentences in the second digest. The first abstract semantic vector and the second abstract semantic vector are average abstract semantic vectors, and the average abstract semantic vectors are obtained by carrying out average processing on sample semantic vectors of each sentence in the sentence set; the sentence collection can include a first abstract and a second abstract. At the same time, the computer device may also average the sample semantic vectors of the N sentences to obtain text semantic vectors (e.g., text semantic vector 80E shown in fig. 8) corresponding to the original text 80D. Further, the computer device may perform contrast learning on the first pre-training model based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector 80E to obtain a second pre-training model.

A fourth step of applying a pre-training model 81m to the downstream task, for example, by which an initial digest model for processing the extraction-type digest task is constructed to train it, so that a text digest model (i.e., a second text digest model) can be obtained. The second text summarization model may be applied in a content production scenario and a data recommendation scenario.

In the embodiment of the application, since the second pre-training model is specially trained for the extracted abstract task, which means that each sentence in the text data (i.e. the business text) to be generated in the abstract is accurately represented by the second pre-training model when the extracted abstract task is processed, so that the probability that each sentence is extracted as an abstract (i.e. the influence degree of each sentence) can be known more accurately when the abstract is predicted based on the semantic vector of each sentence in the follow-up process, and further, the text abstract for representing the business text (i.e. the text data to be generated) can be generated more accurately, thereby improving the accuracy of abstract generation. In addition, when the data recommendation is performed on the business object based on the text abstract in the multi-mode retrieval system, the accuracy and the richness of the data recommendation can be improved.

Further, referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 9, the data processing apparatus 1 may include: the system comprises a sentence dividing module 11, a model obtaining module 12, a first abstract extracting module 13, a second abstract extracting module 14, a contrast learning module 15, a sentence input module 16, an initial sample vector determining module 17, a sample semantic vector determining module 18, a first constructing module 19, a first sample obtaining module 20, a prediction abstract determining module 21, a first training module 22, a second constructing module 23, a second sample obtaining module 24, a clustering module 25, a second training module 26, a query text determining module 27, a text abstract obtaining module 28, a similarity determining module 29 and a business data determining module 30.

The sentence dividing module 11 is configured to, when an original text including N sentences is obtained, divide the N sentences based on an influence degree and an influence degree threshold of each sentence, so as to obtain a positive sample set and a negative sample set corresponding to the original text.

Wherein the sentence dividing module 11 includes: a model acquisition unit 111, a digest prediction unit 112, and a sentence dividing unit 113.

The model obtaining unit 111 is configured to obtain a first text abstract model when obtaining an original text including N sentences;

the abstract prediction unit 112 is configured to input N sentences into a first text abstract model, and perform abstract prediction on each sentence through the first text abstract model to obtain an influence degree of each sentence.

Wherein the digest prediction unit 112 includes: a first token acquisition subunit 1121, a second token acquisition subunit 1122, and a decoding processing subunit 1123.

The first representation obtaining subunit 1121 is configured to input N sentences to a first text abstract model, obtain first sentence representations corresponding to each sentence respectively, and obtain N first sentence representations; the first text summarization model includes a text encoder and a text decoder;

the second token obtaining subunit 1122 is configured to encode each sentence through the text encoder and the N first sentence tokens, to obtain a second sentence token corresponding to each sentence.

Wherein the second token acquisition subunit 1122 is further specifically configured to:

when obtaining the first coding vectors corresponding to N sentences respectively, inputting the N first coding vectors into a second text encoder, and inputting the N first coding vectors into a sentence S through the second text encoder and the N first coding vectors _i Performing a second encoding process to obtain a sentence S _i A corresponding second encoding vector;

The decoding processing subunit 1123 is configured to input N second sentence representations to the text decoder, and decode the N sentences respectively through the text decoder and the N second sentence representations to obtain the influence of each sentence.

The specific implementation manner of the first token acquiring subunit 1121, the second token acquiring subunit 1122, and the decoding processing subunit 1123 may be referred to the description of the influence degree of each sentence in the embodiment corresponding to fig. 4, and will not be further described herein.

The sentence dividing unit 113 is configured to divide the N sentences based on the influence degree and the influence degree threshold of each sentence, so as to obtain a positive sample set and a negative sample set corresponding to the original text.

Wherein the influence threshold comprises a first threshold;

the sentence dividing unit 113 includes: a first traversal subunit 1131, a first add subunit 1132, a second add subunit 1133, a second traversal subunit 1134, a third add subunit 1135, a filter subunit 1136, and a fourth add subunit 1137.

The first traversing subunit 1131 is configured to traverse the N sentences, and determine the traversed sentences as sentences to be divided;

the first adding subunit 1132 is configured to add the sentence to be divided to the positive sample set corresponding to the original text if the influence degree of the sentence to be divided is greater than or equal to a first threshold;

the second adding subunit 1133 is configured to add the sentence to be divided to the negative sample set corresponding to the original text if the influence degree of the sentence to be divided is less than the first threshold.

the second traversing subunit 1134 is configured to traverse the N sentences, and determine the traversed sentences as sentences to be divided;

the third adding subunit 1135 is configured to add the sentence to be divided to the positive sample set corresponding to the original text if the influence degree of the sentence to be divided is greater than or equal to the second threshold;

The filtering subunit 1136 is configured to filter the sentences to be divided if the influence degree of the sentences to be divided is less than the second threshold and greater than the third threshold;

the fourth adding subunit 1137 is configured to add the sentence to be divided to the negative sample set corresponding to the original text if the influence degree of the sentence to be divided is less than or equal to the third threshold.

The specific implementation manner of the first traversing subunit 1131, the first adding subunit 1132, the second adding subunit 1133, the second traversing subunit 1134, the third adding subunit 1135, the filtering subunit 1136 and the fourth adding subunit 1137 may refer to the description of positive and negative sample division of N sentences in the embodiment corresponding to fig. 3, which will not be further described herein.

The specific implementation manner of the model obtaining unit 111, the abstract predicting unit 112 and the sentence dividing unit 113 may refer to the description of step S101 in the embodiment corresponding to fig. 3, and the detailed description will not be repeated here.

The model acquisition module 12 is configured to acquire a first pre-training model; the first pre-training model is used for determining sample semantic vectors corresponding to each sentence respectively; the text semantic vector corresponding to the original text is determined by the sample semantic vector corresponding to each sentence respectively;

The first abstract extracting module 13 is configured to select sentences corresponding to the abstract setting number from the positive sample set as a first abstract, and determine a first abstract semantic vector corresponding to the first abstract based on sample semantic vectors corresponding to the sentences in the first abstract;

the second abstract extracting module 14 is configured to select sentences corresponding to the abstract setting number from the negative sample set as a second abstract, and determine a second abstract semantic vector corresponding to the second abstract based on sample semantic vectors corresponding to sentences in the second abstract;

the contrast learning module 15 is configured to perform contrast learning on the first pre-training model based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector, so as to obtain a second pre-training model; the second pre-training model is used for processing the extracted abstract task.

Wherein, this contrast learning module 15 includes: a loss function acquisition unit 151, a loss determination unit 152, a training result determination unit 153, a first model determination unit 154, a parameter adjustment unit 155, and a second model determination unit 156.

The loss function obtaining unit 151 is configured to obtain a model loss function for comparison learning;

the loss determining unit 152 is configured to determine a model loss corresponding to the model loss function based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector;

The training result determining unit 153 is configured to train the first pre-training model based on model loss to obtain a model training result;

the first model determining unit 154 is configured to take the first pre-training model that satisfies the model convergence condition as the second pre-training model if the model training result indicates that the trained first pre-training model satisfies the model convergence condition.

The parameter adjustment unit 155 is configured to adjust model parameters of the first pre-training model based on a model loss function that does not meet the model convergence condition if the model training result indicates that the trained first pre-training model does not meet the model convergence condition;

the second model determining unit 156 is configured to train the transition model by using the first pre-trained model after the model parameters are adjusted as a transition model, and use the transition model satisfying the model convergence condition as a second pre-trained model until the trained transition model satisfies the model convergence condition.

The specific implementation manner of the loss function obtaining unit 151, the loss determining unit 152, the training result determining unit 153, the first model determining unit 154, the parameter adjusting unit 155 and the second model determining unit 156 may be referred to the description of step S105 in the embodiment corresponding to fig. 3, and the detailed description will not be repeated here.

the sentence input module 16 is used for inputting sentences S _i Inputting to a first pre-training model;

the initial sample vector determination module 17 is used for sentence-based S _i Character characterization of each character in (1) to determine sentence S _i A corresponding initial sample vector; the character representation of a character is commonly determined by the word representation, the segment representation and the character position representation corresponding to the character;

the sample semantic vector determination module 18 is configured to pass through the first pre-training model and the sentence S _i Corresponding initial sample vector, sentence S _i Coding processing is carried out to obtain sentences S _i Corresponding sample semantic vectors.

The first building module 19 is configured to build an initial abstract model for processing the extracted abstract task based on the second pre-training model;

the first sample acquiring module 20 is configured to acquire a sample text for an initial abstract model and a sample label corresponding to the sample text; the sample tag is used for indicating the actual abstract of the sample text;

The prediction summary determining module 21 is configured to input a sample text into an initial summary model, and perform summary prediction on the sample text through the initial summary model to obtain a prediction summary corresponding to the sample text;

the first training module 22 is configured to perform fine-tuning training on the initial abstract model based on the actual abstract and the predicted abstract, so as to obtain a second text abstract model; the second text summarization model is used to predict a text summary of the business text.

Wherein the summary setting number is M; m is a positive integer;

the second building module 23 is configured to build an initial abstract model for processing the extracted abstract task based on the second pre-training model;

the second sample obtaining module 24 is configured to obtain a sample text for an initial abstract model, and encode X sentences in the sample text by using the initial abstract model to obtain X semantic encoding vectors; x is a positive integer;

the clustering module 25 is configured to perform clustering processing on the X semantic coding vectors based on the M initial cluster centers to obtain M clusters;

the second training module 26 is configured to train the initial abstract model based on M clusters, to obtain a second text abstract model; the second text summarization model is used to predict a text summary of the business text.

The query text determining module 27 is configured to determine a query text in the service query request when the service query request is acquired; the service inquiry request is sent by the service terminal equipment;

the text abstract obtaining module 28 is configured to obtain text abstracts corresponding to the H service texts respectively; h is a positive integer; the text abstract is obtained by invoking a second text abstract model and carrying out abstract prediction on a business text;

the similarity determining module 29 is configured to determine text similarity between the query text and the H text summaries, and obtain a service text corresponding to the text summary with the highest text similarity;

the service data determining module 30 is configured to determine service data for returning to the service terminal device based on the acquired service text, so that the service terminal device displays the service data.

The specific implementation manners of the sentence dividing module 11, the model obtaining module 12, the first abstract extracting module 13, the second abstract extracting module 14, the contrast learning module 15, the sentence input module 16, the initial sample vector determining module 17, the sample semantic vector determining module 18, the first constructing module 19, the first sample obtaining module 20, the prediction abstract determining module 21, the first training module 22, the second constructing module 23, the second sample obtaining module 24, the clustering module 25, the second training module 26, the query text determining module 27, the text abstract obtaining module 28, the similarity determining module 29 and the business data determining module 30 may be referred to the description of the steps S201-S207 in the embodiment corresponding to fig. 6, and will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 10, fig. 10 is a schematic diagram of a computer device according to an embodiment of the application. As shown in fig. 10, the computer device 1000 may include: at least one processor 1001, e.g., a CPU, at least one network interface 1004, memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 10, the memory 1005, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application. In some embodiments, the computer device may further include a user interface 1003 shown in fig. 10, for example, if the computer device is a terminal device (for example, the terminal device 100 a) with a model training function shown in fig. 1, the computer device may further include the user interface 1003, where the user interface 1003 may include a Display screen (Display), a Keyboard (Keyboard), and so on.

In the computer device 1000 shown in fig. 10, the network interface 1004 is mainly used for network communication; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in the embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3 and 6, and may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 9, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, implement the data processing method provided by each step in fig. 3 and fig. 6, and specifically, refer to the implementation manners provided by each step in fig. 3 and fig. 6, which are not repeated herein.

The computer readable storage medium may be the data transmission apparatus provided in any of the foregoing embodiments or an internal storage unit of a computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor executes the computer program, so that the computer device may perform the description of the data processing method or apparatus in the foregoing embodiments, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

The terms first, second and the like in the description and in the claims and drawings of embodiments of the application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise," "include," and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include steps or modules not listed or inherent to such process, method, apparatus, article, or device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data processing, comprising:

Acquiring a first pre-training model; the first pre-training model is used for determining sample semantic vectors corresponding to each sentence respectively; the text semantic vector corresponding to the original text is determined by the sample semantic vector corresponding to each sentence;

selecting sentences which are consistent with the set quantity of the abstracts from the positive sample set as a first abstract, and determining a first abstract semantic vector corresponding to the first abstract based on sample semantic vectors corresponding to the sentences in the first abstract;

selecting sentences which are consistent with the set quantity of the abstracts from the negative sample set as a second abstracts, and determining second abstract semantic vectors corresponding to the second abstracts based on sample semantic vectors corresponding to the sentences in the second abstracts;

2. The method according to claim 1, wherein when the original text including N sentences is obtained, dividing the N sentences based on the influence degree and the influence degree threshold of each sentence to obtain a positive sample set and a negative sample set corresponding to the original text, respectively, includes:

When an original text comprising N sentences is acquired, acquiring a first text abstract model;

inputting the N sentences into the first text abstract model, and respectively carrying out abstract prediction on each sentence through the first text abstract model to obtain the influence degree of each sentence;

and dividing the N sentences based on the influence degree of each sentence and the influence degree threshold value to obtain a positive sample set and a negative sample set corresponding to the original text.

3. The method according to claim 2, wherein the inputting the N sentences into the first text abstract model, and performing abstract prediction on each sentence through the first text abstract model, respectively, to obtain the influence degree of each sentence, includes:

inputting the N sentences into the first text abstract model, and obtaining first sentence representations corresponding to each sentence respectively to obtain N first sentence representations; the first text summarization model comprises a text encoder and a text decoder;

encoding each sentence through the text encoder and the N first sentence representations to obtain a second sentence representation corresponding to each sentence;

And inputting N second sentence representations into the text decoder, and respectively decoding the N sentences through the text decoder and the N second sentence representations to obtain the influence degree of each sentence.

4. A method according to claim 3, wherein the N sentences comprise sentence S _i The method comprises the steps of carrying out a first treatment on the surface of the i is less than or equal toA positive integer equal to N; the N first sentence representations include the sentence S _i Corresponding first sentence representation E _i The method comprises the steps of carrying out a first treatment on the surface of the The text encoder comprises a first text encoder and a second text encoder;

the encoding processing is performed on each sentence through the text encoder and the N first sentence representations, so as to obtain a second sentence representation corresponding to each sentence, including:

characterizing E by the first text encoder and the first sentence _i For the sentence S _i Performing a first encoding process to obtain the sentence S _i A corresponding first encoding vector;

when obtaining the first coding vectors corresponding to N sentences respectively, inputting the N first coding vectors into the second text encoder, and inputting the N first coding vectors into the sentence S through the second text encoder and the N first coding vectors _i Performing a second encoding process to obtain the sentence S _i A corresponding second encoding vector;

putting the sentence S _i The corresponding second encoded vector is taken as the sentence S _i Corresponding second sentence representation D _i 。

5. The method of claim 2, wherein the influence threshold comprises a first threshold;

the dividing the N sentences based on the influence degree of each sentence and the influence degree threshold value to obtain a positive sample set and a negative sample set corresponding to the original text, includes:

traversing the N sentences, and determining the traversed sentences as sentences to be divided;

if the influence degree of the sentences to be divided is greater than or equal to the first threshold value, adding the sentences to be divided into positive sample sets corresponding to the original text;

and if the influence degree of the sentences to be divided is smaller than the first threshold value, adding the sentences to be divided into a negative sample set corresponding to the original text.

6. The method of claim 2, wherein the influence threshold comprises a second threshold and a third threshold; the second threshold is greater than the third threshold;

if the influence degree of the sentences to be divided is greater than or equal to the second threshold value, adding the sentences to be divided into positive sample sets corresponding to the original text;

if the influence degree of the sentences to be divided is smaller than the second threshold value and larger than the third threshold value, filtering the sentences to be divided;

and if the influence degree of the sentences to be divided is smaller than or equal to the third threshold value, adding the sentences to be divided into a negative sample set corresponding to the original text.

7. The method of claim 1, wherein the N sentences comprise sentence S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;

the method further comprises the steps of:

putting the sentence S _i Inputting to the first pre-training model;

based on the sentence S _i Character characterization of each character in (1) determining the sentence S _i A corresponding initial sample vector; the character representation of one character is commonly determined by the word representation, the segment representation and the character position representation corresponding to the one character;

through the first pre-training model and the sentence S _i Corresponding initial sample vector for the sentence S _i Coding processing is carried out to obtain the sentence S _i Corresponding sample semantic vectors.

8. The method of claim 1, wherein the first summary semantic vector and the second summary semantic vector are both average summary semantic vectors; the average abstract semantic vector is obtained by carrying out average processing on sample semantic vectors of each sentence in the sentence set; the sentence set includes the first abstract and the second abstract.

9. The method of claim 1, wherein the performing contrast learning on the first pre-training model based on the first abstract semantic vector, the second abstract semantic vector, and the text semantic vector to obtain a second pre-training model comprises:

obtaining a model loss function of contrast learning;

determining model loss corresponding to the model loss function based on the first abstract semantic vector, the second abstract semantic vector and the text semantic vector;

training the first pre-training model based on the model loss to obtain a model training result;

and if the model training result indicates that the trained first pre-training model meets the model convergence condition, taking the first pre-training model meeting the model convergence condition as a second pre-training model.

10. The method according to claim 9, wherein the method further comprises:

if the model training result indicates that the trained first pre-training model does not meet the model convergence condition, adjusting model parameters of the first pre-training model based on the model loss function which does not meet the model convergence condition;

and training the transition model by taking the first pre-training model with the model parameters adjusted as a transition model, and taking the transition model meeting the model convergence condition as a second pre-training model when the trained transition model meets the model convergence condition.

11. The method according to claim 1, wherein the method further comprises:

constructing an initial abstract model for processing an abstract task based on the second pre-training model;

acquiring a sample text aiming at the initial abstract model and a sample label corresponding to the sample text; the sample tag is used for indicating an actual abstract of the sample text;

inputting the sample text into the initial abstract model, and carrying out abstract prediction on the sample text through the initial abstract model to obtain a predicted abstract corresponding to the sample text;

Performing fine tuning training on the initial abstract model based on the actual abstract and the predicted abstract to obtain a second text abstract model; the second text summarization model is used for predicting the text summarization of the business text.

12. The method of claim 1, wherein the summary set number is M; m is a positive integer;

the method further comprises the steps of:

acquiring a sample text aiming at the initial abstract model, and respectively carrying out coding processing on X sentences in the sample text through the initial abstract model to obtain X semantic coding vectors; x is a positive integer;

clustering the X semantic coding vectors based on M initial cluster centers to obtain M clustering clusters;

training the initial abstract model based on the M cluster clusters to obtain a second text abstract model; the second text summarization model is used for predicting the text summarization of the business text.

13. The method according to claim 11 or 12, characterized in that the method further comprises:

when a service query request is acquired, determining a query text in the service query request; the service inquiry request is sent by service terminal equipment;

Acquiring text abstracts corresponding to the H business texts respectively; h is a positive integer; the text abstract is obtained by invoking the second text abstract model and carrying out abstract prediction on a business text;

respectively determining the text similarity between the query text and the H text summaries, and acquiring a service text corresponding to the text summary with the highest text similarity;

and determining service data for returning to the service terminal equipment based on the acquired service text so as to enable the service terminal equipment to display the service data.

14. A data processing apparatus, comprising:

the sentence dividing module is used for dividing N sentences respectively based on influence degree and influence degree threshold value of each sentence when the original text comprising N sentences is obtained, so as to obtain a positive sample set and a negative sample set corresponding to the original text;

the model acquisition module is used for acquiring a first pre-training model; the first pre-training model is used for determining sample semantic vectors corresponding to each sentence respectively; the text semantic vector corresponding to the original text is determined by the sample semantic vector corresponding to each sentence;

The first abstract extraction module is used for selecting sentences which are consistent with the abstract setting quantity from the positive sample set as a first abstract, and determining a first abstract semantic vector corresponding to the first abstract based on sample semantic vectors corresponding to the sentences in the first abstract;

15. A computer device, comprising: a processor and a memory and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide a data communication function, the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1 to 13.

16. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1 to 13.

17. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, which computer program is adapted to be read and executed by a processor to cause a computer device with the processor to perform the method of any one of claims 1 to 13.