CN111046672A

CN111046672A - Multi-scene text abstract generation method

Info

Publication number: CN111046672A
Application number: CN201911264821.5A
Authority: CN
Inventors: 樊昭磊; 吴军; 张伯政; 张述睿; 张福鑫
Original assignee: Shandong Msunhealth Technology Group Co Ltd
Current assignee: Zhongyang Health Technology Group Co ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-04-21
Anticipated expiration: 2039-12-11
Also published as: CN111046672B

Abstract

A multi-scene text abstract generation method comprises model learning and model use, different information preferences under different scenes are fully considered through the method, differential abstract extraction under different scenes of the same document can be achieved, when a text abstract generation system is trained, data corresponding to text abstracts in a one-to-one mode are not used, and data cost is reduced.

Description

Multi-scene text abstract generation method

Technical Field

The invention relates to the technical field of natural language processing and text data mining, in particular to a multi-scene text abstract generating method.

Background

With the rapid development of informatization, the problem of information explosion is gradually brought to people, and how to quickly and accurately extract the desired data from a large amount of data becomes a key for improving the current information acquisition efficiency.

The existing text abstract generating system has fixed information preference for generating abstract no matter with supervision or without supervision, and is difficult to adapt to the field needing to switch scenes continuously, for example, in the medical field, doctors in different departments have obvious difference in the focus point when checking, while the traditional abstract system trained based on the supervision or the unsupervised method has fixed information extraction preference and cannot adapt to the requirements of doctors in different departments.

Disclosure of Invention

In order to overcome the defects of the technology, the invention provides a multi-scene text abstract generating method for extracting different abstract of the same document in different scenes.

The technical scheme adopted by the invention for overcoming the technical problems is as follows:

a multi-scene text abstract generation method comprises model learning and model use, wherein the specific model learning comprises the following steps:

a-1) acquiring an unmarked original corpus data set, wherein the original corpus data set is a plurality of complete articles, characters appearing in the original corpus data set are subjected to non-repeated coding after repeated parts of the characters are removed, the codes are continuous positive integers, and the characters are stored as a dictionary after being in one-to-one correspondence with the codes;

a-2) training a smoothness judging model of the neural network through the obtained original corpus data set to minimize the error of the smoothness judging model;

a-3) carrying out high-dimensional semantic space training through the obtained original corpus data set;

a-4) acquiring scene corpus data sets of required abstracts, and expressing the scene corpus data sets of the abstracts as { T }₁,T₂,T₃,......,T_mWhere T is_iThe method comprises the steps of setting an article set under the ith scene, wherein i is more than or equal to 1 and less than or equal to m, i is a positive integer, m is the number of scenes, and setting the number of articles under each scene in a scene corpus data set of an abstract as a vector { l [/L ]₁,l₂,l₃,......,l_mIn which l_iIs T_iThe weight vector [ lambda ] of each scene is constructed for the dictionary_i0,λ_i1,λ_i2,......,λ_inIn which λ is_ijThe summary weight corresponding to the characters coded as j in the dictionary under the ith scene is represented, j is more than or equal to 0 and less than or equal to n, n is the number of the characters in the dictionary,

N_ijnumber of articles appearing in ith scene for dictionary-coded j_kIs T_kNumber of documents, T_kThe article set under the kth scene is obtained;

a-5) initializing a coding-decoding model of a neural network, extracting an article from an original corpus data set, extracting a plurality of sentences from the article, and forming the sentences into a sentence set;

a-6) inputting the sentence set into an encoder in an encoding-decoding model, decoding a decoder by using a decoding algorithm, and recording a decoding result and character probability distribution of each position in the decoding process;

a-7) sequentially inputting characters of the decoding result into the smoothness judging model, and recording character probability distribution of each position output by the smoothness judging model;

a-8) calculating the character probability distribution of each position in the decoding process and the error of the position corresponding to the character probability distribution of each position output by the currency judging model;

a-9) adjusting a coding-decoding model by using a neural network optimization algorithm, optimizing the error in the step a-8), stopping training if the error is minimum, and otherwise, skipping to execute the step a-5);

the model use comprises the following steps:

b-1) giving the articles waiting for summarization, and forming a set { S ] by breaking the articles waiting for summarization according to the appearance sequence of the sentences₁,S₂,S₃,......,S_oO is the number of sentences, i-th sentence S_iHas a length of L_iI is more than or equal to 1 and less than or equal to o; b-2) the first sentence S in the article to be summarized₁The corresponding set of words is

Every sentence of the first sentenceThe characters look up corresponding numbers from the dictionary, corresponding vectors are taken out from the trained high-dimensional semantic space, and the taken out vectors are sequentially arranged according to the appearance sequence of the corresponding characters to form a vector sequence;

b-3) repeating the step b-2) to obtain each sentence S of the article to be abstracted₁To S_oUsing V, using_SijRepresenting the jth vector in the ith sentence;

b-4) weight vector from step a-4) { λ_i0,λ_i1,λ_i2,......,λ_inTake out the word weight vector of k scenes and represent as { lambda }_k0,λ_k1,λ_k2,......,λ_kn}；

b-5) defining the sentence selection vector of length o as { h }₁,h₂,h₃,......,h_oIf the selection vector h of the ith sentence is h_iEqual to 0 indicates the set S₁,S₂,S₃,......,S_oThe ith sentence S in_iNot in the extracted key sentence set, if h_iEqual to 1 indicates a set S₁,S₂,S₃,......,S_oThe ith sentence S in_iWithin the extracted set of key sentences;

b-6) using the formula

Calculating the sentence selection vector with the maximum value of the objective function, wherein lambda is in the formula_kiFor the summary weight, h, corresponding to the text coded i of the k scene_jValue of the selection vector, V, for the jth sentence_iFor words coded as i, corresponding to vectors, V, in a high-dimensional semantic space_SjtIs a vector in a high-dimensional semantic space corresponding to the t character of the jth sentence in the sentence set to be abstracted, | V_iL is V_iModulo length, | V, of the vector_SjtL is V_SjtThe die length of (2);

b-7) extracting sentences corresponding to the value equal to 1 in the sentence selection vectors calculated in b-6) to form a key sentence set;

b-8) inputting the key sentence set into the step a-6), replacing the sentence set with the key sentence set, inputting the key sentence set into the trained coding-decoding model, and decoding the decoder to obtain the final document abstract.

Furthermore, the characters in the step a) are Chinese characters, English words or numbers, tab marks and space marks are extracted and deleted from continuous numbers with decimal points and continuous English integers, and codes of all characters are stored in a json format.

Further, the training method of the compliance discriminant model in the step a-2) comprises the following steps:

a-2.1) initializing a compliance judgment model in a neural network, extracting an article from the obtained original corpus data set, and inputting an initial character of the article into the compliance judgment model;

a-2.2) outputting the probability distribution of the first character of the article by the currency judging model, carrying out error calculation on the first character of the article and the probability distribution output by the currency judging model, and recording errors;

a-2.3) forming a sequence with the length of 2 by the initial character of the article and the first character in the article, inputting the sequence into a smoothness judging model, outputting the probability distribution of the 2 nd character in the article by the smoothness judging model, and performing error calculation and recording the error by using the 2 nd character in the article and the probability distribution output by the smoothness judging model;

a-2.4) repeating the step a-2.3) until obtaining the probability distribution of all characters output by the smoothness discriminant model, calculating the error between the probability distribution of all characters and the end symbol of the article and recording the error

a-2.5) using the error recorded in the step a-2.4) to carry out optimization training on the parameter of the compliance discriminant model in the neural network, stopping the training if the obtained error in the step a-5) is minimum, otherwise, skipping to execute the step a-2.1).

Further, the high-dimensional semantic space training in the step a-3) comprises the following steps:

a-3.1) initialization vector set V₀,V₁,V₂,V₃,......,V_nIn which V is_iIs the height of the ith character in the dictionary in a-1)Dimension vector representation, where 1 ≦ i ≦ n, V₀Representing the high-dimensional vector of characters which do not exist in the dictionary, wherein n is the number of the characters in the dictionary;

a-3.2) extracting continuous k characters from the original corpus data set to form character fragments;

a-3.3) converting each word in the character segments into corresponding codes by using a dictionary to form the code sequences corresponding to the character segments;

a-3.4) assembling the coding sequence from a set of vectors { V }₀,V₁,V₂,V₃,......,V_nSequentially extracting corresponding vectors to form a vector sequence;

a-3.5) optimizing vector set, stopping training until the cosine similarity of any two vectors in the vector sequence is maximum and the cosine similarity of the vectors outside the vector sequence is minimum, otherwise, skipping to execute the step a-3.2).

The invention has the beneficial effects that: by the method, different information preferences under different scenes are fully considered, the extraction of the differential abstracts of the same document under different scenes can be realized, and when a text abstract generating system is trained, data corresponding to the text abstracts in a one-to-one mode are not used, so that the data cost is reduced.

Detailed Description

The present invention is further explained below.

a-1) obtaining an unmarked original corpus data set, wherein the original corpus data set is a plurality of complete articles, characters appearing in the original corpus data set are subjected to non-repeated coding after repeated parts of the characters are removed, the codes are continuous positive integers, and the characters are stored as a dictionary after being in one-to-one correspondence with the codes. The channel for acquiring the original corpus data set can be an existing data set, a news article, an encyclopedia article, a medical record and the like.

a-2) training a smoothness judging model of the neural network through the obtained original corpus data set to minimize the error of the smoothness judging model. The deep learning training of the compliance judging model can be methods such as RNN, GRU, LSTM, Transformer and the like.

a-3) carrying out high-dimensional semantic space training through the obtained original corpus data set. The training method can be a skip-gram, a CBOW, a GloVe and the like.

the model use comprises the following steps:

b-1) giving the articles waiting for summarization, and forming a set { S ] by breaking the articles waiting for summarization according to the appearance sequence of the sentences₁,S₂,S₃,......,S_oO is the number of sentences, i-th sentence S_iHas a length of L_i，1≤i≤o；

b-2) the first sentence S in the article to be summarized₁The corresponding set of words is

Searching each character of the first sentence from the dictionary for a corresponding number, taking out a corresponding vector from the trained high-dimensional semantic space, and sequentially arranging the taken-out vectors according to the sequence of the corresponding characters to form a vector sequence;

b-5) defining the sentence selection vector of length o as { h }₁,h₂,h₃,......,h_oIf the selection vector h of the ith sentence is h_iEqual to 0 indicates the set S₁,S₂,S₃,......,S_oThe ith sentence S in_iNot in the extracted key sentence set, if h_iEqual to 1 indicates a set S₁,S₂,S₃,......,S_oInThe ith sentence S_iWithin the extracted set of key sentences;

b-6) using the formula

By the method, different information preferences under different scenes are fully considered, the extraction of the differential abstracts of the same document under different scenes can be realized, and when a text abstract generating system is trained, data corresponding to the text abstracts in a one-to-one mode are not used, so that the data cost is reduced.

Further, the characters in step a) are Chinese characters, English words or numbers, tab and space characters are extracted and deleted from continuous numbers with decimal points and continuous English integers, split units are combined in the same type, the whole data set is traversed, a non-repeated split unit set is obtained, the final set is encoded, and a json format is used for storage (for example: { 'of' 1, 'of' 2, … … }). .

The finally trained model can deduce the probability of the next word in the neighborhood according to the given information, the probability is a vector, the length of the vector is the same as that of the dictionary stored in the json format, and the probability of the word of the corresponding index in the dictionary is stored in the corresponding position.

Giving a character sequence, giving a starting character, and outputting the probability distribution of a first character; giving the probability distribution of the first character and outputting the second character, and circulating in sequence until the input of all the characters is completed, namely obtaining the probability distribution corresponding to each position.

a-3.1) initialization vector set V₀,V₁,V₂,V₃,......,V_nIn which V is_iIs high-dimensional vector representation of ith character in dictionary in a-1), wherein i is more than or equal to 1 and less than or equal to n, V₀Is a high-dimensional vector representation of characters which do not exist in a dictionary, and n is a characterThe number of characters in the dictionary;

Claims

1. A multi-scene text abstract generation method is characterized by comprising model learning and model use, wherein the specific model learning comprises the following steps:

the model use comprises the following steps:

b-6) using the formula

2. The multi-scenario text summary generation method of claim 1, characterized in that: the characters in the step a) are Chinese characters, English words or numbers, continuous numbers with decimal points and continuous English integers are extracted, tab marks and space marks are deleted, and codes of all characters are stored in a json format.

3. The multi-scenario text abstract generating method according to claim 1, wherein the training method of the compliance discriminant model in the step a-2) comprises the following steps:

4. The method for generating a multi-scenario text summary according to claim 1, wherein the training of the high-dimensional semantic space in step a-3) includes the following steps:

a-3.1) initialization vector set V₀,V₁,V₂,V₃,......,V_nIn which V is_iIs high-dimensional vector representation of ith character in dictionary in a-1), wherein i is more than or equal to 1 and less than or equal to n, V₀Representing the high-dimensional vector of characters which do not exist in the dictionary, wherein n is the number of the characters in the dictionary;