CN111046672A - Multi-scene text abstract generation method - Google Patents

Multi-scene text abstract generation method Download PDF

Info

Publication number
CN111046672A
CN111046672A CN201911264821.5A CN201911264821A CN111046672A CN 111046672 A CN111046672 A CN 111046672A CN 201911264821 A CN201911264821 A CN 201911264821A CN 111046672 A CN111046672 A CN 111046672A
Authority
CN
China
Prior art keywords
sentence
vector
character
model
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911264821.5A
Other languages
Chinese (zh)
Other versions
CN111046672B (en
Inventor
樊昭磊
吴军
张伯政
张述睿
张福鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongyang Health Technology Group Co ltd
Original Assignee
Shandong Msunhealth Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Msunhealth Technology Group Co Ltd filed Critical Shandong Msunhealth Technology Group Co Ltd
Priority to CN201911264821.5A priority Critical patent/CN111046672B/en
Publication of CN111046672A publication Critical patent/CN111046672A/en
Application granted granted Critical
Publication of CN111046672B publication Critical patent/CN111046672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

A multi-scene text abstract generation method comprises model learning and model use, different information preferences under different scenes are fully considered through the method, differential abstract extraction under different scenes of the same document can be achieved, when a text abstract generation system is trained, data corresponding to text abstracts in a one-to-one mode are not used, and data cost is reduced.

Description

Multi-scene text abstract generation method
Technical Field
The invention relates to the technical field of natural language processing and text data mining, in particular to a multi-scene text abstract generating method.
Background
With the rapid development of informatization, the problem of information explosion is gradually brought to people, and how to quickly and accurately extract the desired data from a large amount of data becomes a key for improving the current information acquisition efficiency.
The existing text abstract generating system has fixed information preference for generating abstract no matter with supervision or without supervision, and is difficult to adapt to the field needing to switch scenes continuously, for example, in the medical field, doctors in different departments have obvious difference in the focus point when checking, while the traditional abstract system trained based on the supervision or the unsupervised method has fixed information extraction preference and cannot adapt to the requirements of doctors in different departments.
Disclosure of Invention
In order to overcome the defects of the technology, the invention provides a multi-scene text abstract generating method for extracting different abstract of the same document in different scenes.
The technical scheme adopted by the invention for overcoming the technical problems is as follows:
a multi-scene text abstract generation method comprises model learning and model use, wherein the specific model learning comprises the following steps:
a-1) acquiring an unmarked original corpus data set, wherein the original corpus data set is a plurality of complete articles, characters appearing in the original corpus data set are subjected to non-repeated coding after repeated parts of the characters are removed, the codes are continuous positive integers, and the characters are stored as a dictionary after being in one-to-one correspondence with the codes;
a-2) training a smoothness judging model of the neural network through the obtained original corpus data set to minimize the error of the smoothness judging model;
a-3) carrying out high-dimensional semantic space training through the obtained original corpus data set;
a-4) acquiring scene corpus data sets of required abstracts, and expressing the scene corpus data sets of the abstracts as { T }1,T2,T3,......,TmWhere T isiThe method comprises the steps of setting an article set under the ith scene, wherein i is more than or equal to 1 and less than or equal to m, i is a positive integer, m is the number of scenes, and setting the number of articles under each scene in a scene corpus data set of an abstract as a vector { l [/L ]1,l2,l3,......,lmIn which liIs TiThe weight vector [ lambda ] of each scene is constructed for the dictionaryi0i1i2,......,λinIn which λ isijThe summary weight corresponding to the characters coded as j in the dictionary under the ith scene is represented, j is more than or equal to 0 and less than or equal to n, n is the number of the characters in the dictionary,
Figure BDA0002312087630000021
Nijnumber of articles appearing in ith scene for dictionary-coded jkIs TkNumber of documents, TkThe article set under the kth scene is obtained;
a-5) initializing a coding-decoding model of a neural network, extracting an article from an original corpus data set, extracting a plurality of sentences from the article, and forming the sentences into a sentence set;
a-6) inputting the sentence set into an encoder in an encoding-decoding model, decoding a decoder by using a decoding algorithm, and recording a decoding result and character probability distribution of each position in the decoding process;
a-7) sequentially inputting characters of the decoding result into the smoothness judging model, and recording character probability distribution of each position output by the smoothness judging model;
a-8) calculating the character probability distribution of each position in the decoding process and the error of the position corresponding to the character probability distribution of each position output by the currency judging model;
a-9) adjusting a coding-decoding model by using a neural network optimization algorithm, optimizing the error in the step a-8), stopping training if the error is minimum, and otherwise, skipping to execute the step a-5);
the model use comprises the following steps:
b-1) giving the articles waiting for summarization, and forming a set { S ] by breaking the articles waiting for summarization according to the appearance sequence of the sentences1,S2,S3,......,SoO is the number of sentences, i-th sentence SiHas a length of LiI is more than or equal to 1 and less than or equal to o; b-2) the first sentence S in the article to be summarized1The corresponding set of words is
Figure BDA0002312087630000032
Every sentence of the first sentenceThe characters look up corresponding numbers from the dictionary, corresponding vectors are taken out from the trained high-dimensional semantic space, and the taken out vectors are sequentially arranged according to the appearance sequence of the corresponding characters to form a vector sequence;
b-3) repeating the step b-2) to obtain each sentence S of the article to be abstracted1To SoUsing V, usingSijRepresenting the jth vector in the ith sentence;
b-4) weight vector from step a-4) { λi0i1i2,......,λinTake out the word weight vector of k scenes and represent as { lambda }k0k1k2,......,λkn};
b-5) defining the sentence selection vector of length o as { h }1,h2,h3,......,hoIf the selection vector h of the ith sentence is hiEqual to 0 indicates the set S1,S2,S3,......,SoThe ith sentence S iniNot in the extracted key sentence set, if hiEqual to 1 indicates a set S1,S2,S3,......,SoThe ith sentence S iniWithin the extracted set of key sentences;
b-6) using the formula
Figure BDA0002312087630000031
Calculating the sentence selection vector with the maximum value of the objective function, wherein lambda is in the formulakiFor the summary weight, h, corresponding to the text coded i of the k scenejValue of the selection vector, V, for the jth sentenceiFor words coded as i, corresponding to vectors, V, in a high-dimensional semantic spaceSjtIs a vector in a high-dimensional semantic space corresponding to the t character of the jth sentence in the sentence set to be abstracted, | ViL is ViModulo length, | V, of the vectorSjtL is VSjtThe die length of (2);
b-7) extracting sentences corresponding to the value equal to 1 in the sentence selection vectors calculated in b-6) to form a key sentence set;
b-8) inputting the key sentence set into the step a-6), replacing the sentence set with the key sentence set, inputting the key sentence set into the trained coding-decoding model, and decoding the decoder to obtain the final document abstract.
Furthermore, the characters in the step a) are Chinese characters, English words or numbers, tab marks and space marks are extracted and deleted from continuous numbers with decimal points and continuous English integers, and codes of all characters are stored in a json format.
Further, the training method of the compliance discriminant model in the step a-2) comprises the following steps:
a-2.1) initializing a compliance judgment model in a neural network, extracting an article from the obtained original corpus data set, and inputting an initial character of the article into the compliance judgment model;
a-2.2) outputting the probability distribution of the first character of the article by the currency judging model, carrying out error calculation on the first character of the article and the probability distribution output by the currency judging model, and recording errors;
a-2.3) forming a sequence with the length of 2 by the initial character of the article and the first character in the article, inputting the sequence into a smoothness judging model, outputting the probability distribution of the 2 nd character in the article by the smoothness judging model, and performing error calculation and recording the error by using the 2 nd character in the article and the probability distribution output by the smoothness judging model;
a-2.4) repeating the step a-2.3) until obtaining the probability distribution of all characters output by the smoothness discriminant model, calculating the error between the probability distribution of all characters and the end symbol of the article and recording the error
a-2.5) using the error recorded in the step a-2.4) to carry out optimization training on the parameter of the compliance discriminant model in the neural network, stopping the training if the obtained error in the step a-5) is minimum, otherwise, skipping to execute the step a-2.1).
Further, the high-dimensional semantic space training in the step a-3) comprises the following steps:
a-3.1) initialization vector set V0,V1,V2,V3,......,VnIn which V isiIs the height of the ith character in the dictionary in a-1)Dimension vector representation, where 1 ≦ i ≦ n, V0Representing the high-dimensional vector of characters which do not exist in the dictionary, wherein n is the number of the characters in the dictionary;
a-3.2) extracting continuous k characters from the original corpus data set to form character fragments;
a-3.3) converting each word in the character segments into corresponding codes by using a dictionary to form the code sequences corresponding to the character segments;
a-3.4) assembling the coding sequence from a set of vectors { V }0,V1,V2,V3,......,VnSequentially extracting corresponding vectors to form a vector sequence;
a-3.5) optimizing vector set, stopping training until the cosine similarity of any two vectors in the vector sequence is maximum and the cosine similarity of the vectors outside the vector sequence is minimum, otherwise, skipping to execute the step a-3.2).
The invention has the beneficial effects that: by the method, different information preferences under different scenes are fully considered, the extraction of the differential abstracts of the same document under different scenes can be realized, and when a text abstract generating system is trained, data corresponding to the text abstracts in a one-to-one mode are not used, so that the data cost is reduced.
Detailed Description
The present invention is further explained below.
A multi-scene text abstract generation method comprises model learning and model use, wherein the specific model learning comprises the following steps:
a-1) obtaining an unmarked original corpus data set, wherein the original corpus data set is a plurality of complete articles, characters appearing in the original corpus data set are subjected to non-repeated coding after repeated parts of the characters are removed, the codes are continuous positive integers, and the characters are stored as a dictionary after being in one-to-one correspondence with the codes. The channel for acquiring the original corpus data set can be an existing data set, a news article, an encyclopedia article, a medical record and the like.
a-2) training a smoothness judging model of the neural network through the obtained original corpus data set to minimize the error of the smoothness judging model. The deep learning training of the compliance judging model can be methods such as RNN, GRU, LSTM, Transformer and the like.
a-3) carrying out high-dimensional semantic space training through the obtained original corpus data set. The training method can be a skip-gram, a CBOW, a GloVe and the like.
a-4) acquiring scene corpus data sets of required abstracts, and expressing the scene corpus data sets of the abstracts as { T }1,T2,T3,......,TmWhere T isiThe method comprises the steps of setting an article set under the ith scene, wherein i is more than or equal to 1 and less than or equal to m, i is a positive integer, m is the number of scenes, and setting the number of articles under each scene in a scene corpus data set of an abstract as a vector { l [/L ]1,l2,l3,......,lmIn which liIs TiThe weight vector [ lambda ] of each scene is constructed for the dictionaryi0i1i2,......,λinIn which λ isijThe summary weight corresponding to the characters coded as j in the dictionary under the ith scene is represented, j is more than or equal to 0 and less than or equal to n, n is the number of the characters in the dictionary,
Figure BDA0002312087630000061
Nijnumber of articles appearing in ith scene for dictionary-coded jkIs TkNumber of documents, TkThe article set under the kth scene is obtained;
a-5) initializing a coding-decoding model of a neural network, extracting an article from an original corpus data set, extracting a plurality of sentences from the article, and forming the sentences into a sentence set;
a-6) inputting the sentence set into an encoder in an encoding-decoding model, decoding a decoder by using a decoding algorithm, and recording a decoding result and character probability distribution of each position in the decoding process;
a-7) sequentially inputting characters of the decoding result into the smoothness judging model, and recording character probability distribution of each position output by the smoothness judging model;
a-8) calculating the character probability distribution of each position in the decoding process and the error of the position corresponding to the character probability distribution of each position output by the currency judging model;
a-9) adjusting a coding-decoding model by using a neural network optimization algorithm, optimizing the error in the step a-8), stopping training if the error is minimum, and otherwise, skipping to execute the step a-5);
the model use comprises the following steps:
b-1) giving the articles waiting for summarization, and forming a set { S ] by breaking the articles waiting for summarization according to the appearance sequence of the sentences1,S2,S3,......,SoO is the number of sentences, i-th sentence SiHas a length of Li,1≤i≤o;
b-2) the first sentence S in the article to be summarized1The corresponding set of words is
Figure BDA0002312087630000062
Searching each character of the first sentence from the dictionary for a corresponding number, taking out a corresponding vector from the trained high-dimensional semantic space, and sequentially arranging the taken-out vectors according to the sequence of the corresponding characters to form a vector sequence;
b-3) repeating the step b-2) to obtain each sentence S of the article to be abstracted1To SoUsing V, usingSijRepresenting the jth vector in the ith sentence;
b-4) weight vector from step a-4) { λi0i1i2,......,λinTake out the word weight vector of k scenes and represent as { lambda }k0k1k2,......,λkn};
b-5) defining the sentence selection vector of length o as { h }1,h2,h3,......,hoIf the selection vector h of the ith sentence is hiEqual to 0 indicates the set S1,S2,S3,......,SoThe ith sentence S iniNot in the extracted key sentence set, if hiEqual to 1 indicates a set S1,S2,S3,......,SoInThe ith sentence SiWithin the extracted set of key sentences;
b-6) using the formula
Figure BDA0002312087630000071
Calculating the sentence selection vector with the maximum value of the objective function, wherein lambda is in the formulakiFor the summary weight, h, corresponding to the text coded i of the k scenejValue of the selection vector, V, for the jth sentenceiFor words coded as i, corresponding to vectors, V, in a high-dimensional semantic spaceSjtIs a vector in a high-dimensional semantic space corresponding to the t character of the jth sentence in the sentence set to be abstracted, | ViL is ViModulo length, | V, of the vectorSjtL is VSjtThe die length of (2);
b-7) extracting sentences corresponding to the value equal to 1 in the sentence selection vectors calculated in b-6) to form a key sentence set;
b-8) inputting the key sentence set into the step a-6), replacing the sentence set with the key sentence set, inputting the key sentence set into the trained coding-decoding model, and decoding the decoder to obtain the final document abstract.
By the method, different information preferences under different scenes are fully considered, the extraction of the differential abstracts of the same document under different scenes can be realized, and when a text abstract generating system is trained, data corresponding to the text abstracts in a one-to-one mode are not used, so that the data cost is reduced.
Further, the characters in step a) are Chinese characters, English words or numbers, tab and space characters are extracted and deleted from continuous numbers with decimal points and continuous English integers, split units are combined in the same type, the whole data set is traversed, a non-repeated split unit set is obtained, the final set is encoded, and a json format is used for storage (for example: { 'of' 1, 'of' 2, … … }). .
Further, the training method of the compliance discriminant model in the step a-2) comprises the following steps:
a-2.1) initializing a compliance judgment model in a neural network, extracting an article from the obtained original corpus data set, and inputting an initial character of the article into the compliance judgment model;
a-2.2) outputting the probability distribution of the first character of the article by the currency judging model, carrying out error calculation on the first character of the article and the probability distribution output by the currency judging model, and recording errors;
a-2.3) forming a sequence with the length of 2 by the initial character of the article and the first character in the article, inputting the sequence into a smoothness judging model, outputting the probability distribution of the 2 nd character in the article by the smoothness judging model, and performing error calculation and recording the error by using the 2 nd character in the article and the probability distribution output by the smoothness judging model;
a-2.4) repeating the step a-2.3) until obtaining the probability distribution of all characters output by the smoothness discriminant model, calculating the error between the probability distribution of all characters and the end symbol of the article and recording the error
a-2.5) using the error recorded in the step a-2.4) to carry out optimization training on the parameter of the compliance discriminant model in the neural network, stopping the training if the obtained error in the step a-5) is minimum, otherwise, skipping to execute the step a-2.1).
The finally trained model can deduce the probability of the next word in the neighborhood according to the given information, the probability is a vector, the length of the vector is the same as that of the dictionary stored in the json format, and the probability of the word of the corresponding index in the dictionary is stored in the corresponding position.
Giving a character sequence, giving a starting character, and outputting the probability distribution of a first character; giving the probability distribution of the first character and outputting the second character, and circulating in sequence until the input of all the characters is completed, namely obtaining the probability distribution corresponding to each position.
Further, the high-dimensional semantic space training in the step a-3) comprises the following steps:
a-3.1) initialization vector set V0,V1,V2,V3,......,VnIn which V isiIs high-dimensional vector representation of ith character in dictionary in a-1), wherein i is more than or equal to 1 and less than or equal to n, V0Is a high-dimensional vector representation of characters which do not exist in a dictionary, and n is a characterThe number of characters in the dictionary;
a-3.2) extracting continuous k characters from the original corpus data set to form character fragments;
a-3.3) converting each word in the character segments into corresponding codes by using a dictionary to form the code sequences corresponding to the character segments;
a-3.4) assembling the coding sequence from a set of vectors { V }0,V1,V2,V3,......,VnSequentially extracting corresponding vectors to form a vector sequence;
a-3.5) optimizing vector set, stopping training until the cosine similarity of any two vectors in the vector sequence is maximum and the cosine similarity of the vectors outside the vector sequence is minimum, otherwise, skipping to execute the step a-3.2).

Claims (4)

1. A multi-scene text abstract generation method is characterized by comprising model learning and model use, wherein the specific model learning comprises the following steps:
a-1) acquiring an unmarked original corpus data set, wherein the original corpus data set is a plurality of complete articles, characters appearing in the original corpus data set are subjected to non-repeated coding after repeated parts of the characters are removed, the codes are continuous positive integers, and the characters are stored as a dictionary after being in one-to-one correspondence with the codes;
a-2) training a smoothness judging model of the neural network through the obtained original corpus data set to minimize the error of the smoothness judging model;
a-3) carrying out high-dimensional semantic space training through the obtained original corpus data set;
a-4) acquiring scene corpus data sets of required abstracts, and expressing the scene corpus data sets of the abstracts as { T }1,T2,T3,......,TmWhere T isiThe method comprises the steps of setting an article set under the ith scene, wherein i is more than or equal to 1 and less than or equal to m, i is a positive integer, m is the number of scenes, and setting the number of articles under each scene in a scene corpus data set of an abstract as a vector { l [/L ]1,l2,l3,......,lmIn which liIs TiThe weight vector [ lambda ] of each scene is constructed for the dictionaryi0i1i2,......,λinIn which λ isijThe summary weight corresponding to the characters coded as j in the dictionary under the ith scene is represented, j is more than or equal to 0 and less than or equal to n, n is the number of the characters in the dictionary,
Figure FDA0002312087620000011
Nijnumber of articles appearing in ith scene for dictionary-coded jkIs TkNumber of documents, TkThe article set under the kth scene is obtained;
a-5) initializing a coding-decoding model of a neural network, extracting an article from an original corpus data set, extracting a plurality of sentences from the article, and forming the sentences into a sentence set;
a-6) inputting the sentence set into an encoder in an encoding-decoding model, decoding a decoder by using a decoding algorithm, and recording a decoding result and character probability distribution of each position in the decoding process;
a-7) sequentially inputting characters of the decoding result into the smoothness judging model, and recording character probability distribution of each position output by the smoothness judging model;
a-8) calculating the character probability distribution of each position in the decoding process and the error of the position corresponding to the character probability distribution of each position output by the currency judging model;
a-9) adjusting a coding-decoding model by using a neural network optimization algorithm, optimizing the error in the step a-8), stopping training if the error is minimum, and otherwise, skipping to execute the step a-5);
the model use comprises the following steps:
b-1) giving the articles waiting for summarization, and forming a set { S ] by breaking the articles waiting for summarization according to the appearance sequence of the sentences1,S2,S3,......,SoO is the number of sentences, i-th sentence SiHas a length of LiI is more than or equal to 1 and less than or equal to o; b-2) the first sentence S in the article to be summarized1The corresponding set of words is
Figure FDA0002312087620000021
Searching each character of the first sentence from the dictionary for a corresponding number, taking out a corresponding vector from the trained high-dimensional semantic space, and sequentially arranging the taken-out vectors according to the sequence of the corresponding characters to form a vector sequence;
b-3) repeating the step b-2) to obtain each sentence S of the article to be abstracted1To SoUsing V, usingSijRepresenting the jth vector in the ith sentence;
b-4) weight vector from step a-4) { λi0i1i2,......,λinTake out the word weight vector of k scenes and represent as { lambda }k0k1k2,......,λkn};
b-5) defining the sentence selection vector of length o as { h }1,h2,h3,......,hoIf the selection vector h of the ith sentence is hiEqual to 0 indicates the set S1,S2,S3,......,SoThe ith sentence S iniNot in the extracted key sentence set, if hiEqual to 1 indicates a set S1,S2,S3,......,SoThe ith sentence S iniWithin the extracted set of key sentences;
b-6) using the formula
Figure FDA0002312087620000031
Calculating the sentence selection vector with the maximum value of the objective function, wherein lambda is in the formulakiFor the summary weight, h, corresponding to the text coded i of the k scenejValue of the selection vector, V, for the jth sentenceiFor words coded as i, corresponding to vectors, V, in a high-dimensional semantic spaceSjtIs a vector in a high-dimensional semantic space corresponding to the t character of the jth sentence in the sentence set to be abstracted, | ViL is ViModulo length, | V, of the vectorSjtL is VSjtThe die length of (2);
b-7) extracting sentences corresponding to the value equal to 1 in the sentence selection vectors calculated in b-6) to form a key sentence set;
b-8) inputting the key sentence set into the step a-6), replacing the sentence set with the key sentence set, inputting the key sentence set into the trained coding-decoding model, and decoding the decoder to obtain the final document abstract.
2. The multi-scenario text summary generation method of claim 1, characterized in that: the characters in the step a) are Chinese characters, English words or numbers, continuous numbers with decimal points and continuous English integers are extracted, tab marks and space marks are deleted, and codes of all characters are stored in a json format.
3. The multi-scenario text abstract generating method according to claim 1, wherein the training method of the compliance discriminant model in the step a-2) comprises the following steps:
a-2.1) initializing a compliance judgment model in a neural network, extracting an article from the obtained original corpus data set, and inputting an initial character of the article into the compliance judgment model;
a-2.2) outputting the probability distribution of the first character of the article by the currency judging model, carrying out error calculation on the first character of the article and the probability distribution output by the currency judging model, and recording errors;
a-2.3) forming a sequence with the length of 2 by the initial character of the article and the first character in the article, inputting the sequence into a smoothness judging model, outputting the probability distribution of the 2 nd character in the article by the smoothness judging model, and performing error calculation and recording the error by using the 2 nd character in the article and the probability distribution output by the smoothness judging model;
a-2.4) repeating the step a-2.3) until obtaining the probability distribution of all characters output by the smoothness discriminant model, calculating the error between the probability distribution of all characters and the end symbol of the article and recording the error
a-2.5) using the error recorded in the step a-2.4) to carry out optimization training on the parameter of the compliance discriminant model in the neural network, stopping the training if the obtained error in the step a-5) is minimum, otherwise, skipping to execute the step a-2.1).
4. The method for generating a multi-scenario text summary according to claim 1, wherein the training of the high-dimensional semantic space in step a-3) includes the following steps:
a-3.1) initialization vector set V0,V1,V2,V3,......,VnIn which V isiIs high-dimensional vector representation of ith character in dictionary in a-1), wherein i is more than or equal to 1 and less than or equal to n, V0Representing the high-dimensional vector of characters which do not exist in the dictionary, wherein n is the number of the characters in the dictionary;
a-3.2) extracting continuous k characters from the original corpus data set to form character fragments;
a-3.3) converting each word in the character segments into corresponding codes by using a dictionary to form the code sequences corresponding to the character segments;
a-3.4) assembling the coding sequence from a set of vectors { V }0,V1,V2,V3,......,VnSequentially extracting corresponding vectors to form a vector sequence;
a-3.5) optimizing vector set, stopping training until the cosine similarity of any two vectors in the vector sequence is maximum and the cosine similarity of the vectors outside the vector sequence is minimum, otherwise, skipping to execute the step a-3.2).
CN201911264821.5A 2019-12-11 2019-12-11 Multi-scene text abstract generation method Active CN111046672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911264821.5A CN111046672B (en) 2019-12-11 2019-12-11 Multi-scene text abstract generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911264821.5A CN111046672B (en) 2019-12-11 2019-12-11 Multi-scene text abstract generation method

Publications (2)

Publication Number Publication Date
CN111046672A true CN111046672A (en) 2020-04-21
CN111046672B CN111046672B (en) 2020-07-14

Family

ID=70235588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911264821.5A Active CN111046672B (en) 2019-12-11 2019-12-11 Multi-scene text abstract generation method

Country Status (1)

Country Link
CN (1) CN111046672B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976290A (en) * 2023-06-19 2023-10-31 珠海盈米基金销售有限公司 Multi-scene information abstract generation method and device based on autoregressive model

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945228A (en) * 2012-10-29 2013-02-27 广西工学院 Multi-document summarization method based on text segmentation
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word
CN108062351A (en) * 2017-11-14 2018-05-22 厦门市美亚柏科信息股份有限公司 Text snippet extracting method, readable storage medium storing program for executing on particular topic classification
US20180373986A1 (en) * 2017-06-26 2018-12-27 QbitLogic, Inc. Machine learning using dynamic multilayer perceptrons
CN109977981A (en) * 2017-12-27 2019-07-05 深圳市优必选科技有限公司 Scene analytic method, robot and storage device based on binocular vision
CN110134964A (en) * 2019-05-20 2019-08-16 中国科学技术大学 A kind of text matching technique based on stratification convolutional neural networks and attention mechanism
CN110162778A (en) * 2019-04-02 2019-08-23 阿里巴巴集团控股有限公司 The generation method and device of text snippet
CN110196903A (en) * 2019-05-06 2019-09-03 中国海洋大学 A kind of method and system for for article generation abstract
CN110210037A (en) * 2019-06-12 2019-09-06 四川大学 Category detection method towards evidence-based medicine EBM field
CN110263257A (en) * 2019-06-24 2019-09-20 北京交通大学 Multi-source heterogeneous data mixing recommended models based on deep learning
CN110287309A (en) * 2019-06-21 2019-09-27 深圳大学 The method of rapidly extracting text snippet
CN110347819A (en) * 2019-06-21 2019-10-18 同济大学 A kind of text snippet generation method based on positive negative sample dual training
CN110362654A (en) * 2018-04-09 2019-10-22 谢碧青 Target group data acquires classification method
CN110473636A (en) * 2019-08-22 2019-11-19 山东众阳健康科技集团有限公司 Intelligent doctor's advice recommended method and system based on deep learning
CN110491465A (en) * 2019-08-20 2019-11-22 山东众阳健康科技集团有限公司 Classification of diseases coding method, system, equipment and medium based on deep learning
CN110532328A (en) * 2019-08-26 2019-12-03 哈尔滨工程大学 A kind of text concept figure building method

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945228A (en) * 2012-10-29 2013-02-27 广西工学院 Multi-document summarization method based on text segmentation
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word
US20180373986A1 (en) * 2017-06-26 2018-12-27 QbitLogic, Inc. Machine learning using dynamic multilayer perceptrons
CN108062351A (en) * 2017-11-14 2018-05-22 厦门市美亚柏科信息股份有限公司 Text snippet extracting method, readable storage medium storing program for executing on particular topic classification
CN109977981A (en) * 2017-12-27 2019-07-05 深圳市优必选科技有限公司 Scene analytic method, robot and storage device based on binocular vision
CN110362654A (en) * 2018-04-09 2019-10-22 谢碧青 Target group data acquires classification method
CN110162778A (en) * 2019-04-02 2019-08-23 阿里巴巴集团控股有限公司 The generation method and device of text snippet
CN110196903A (en) * 2019-05-06 2019-09-03 中国海洋大学 A kind of method and system for for article generation abstract
CN110134964A (en) * 2019-05-20 2019-08-16 中国科学技术大学 A kind of text matching technique based on stratification convolutional neural networks and attention mechanism
CN110210037A (en) * 2019-06-12 2019-09-06 四川大学 Category detection method towards evidence-based medicine EBM field
CN110287309A (en) * 2019-06-21 2019-09-27 深圳大学 The method of rapidly extracting text snippet
CN110347819A (en) * 2019-06-21 2019-10-18 同济大学 A kind of text snippet generation method based on positive negative sample dual training
CN110263257A (en) * 2019-06-24 2019-09-20 北京交通大学 Multi-source heterogeneous data mixing recommended models based on deep learning
CN110491465A (en) * 2019-08-20 2019-11-22 山东众阳健康科技集团有限公司 Classification of diseases coding method, system, equipment and medium based on deep learning
CN110473636A (en) * 2019-08-22 2019-11-19 山东众阳健康科技集团有限公司 Intelligent doctor's advice recommended method and system based on deep learning
CN110532328A (en) * 2019-08-26 2019-12-03 哈尔滨工程大学 A kind of text concept figure building method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YANG LIU等: "Text Summarization with Pretrained Encoders", 《HTTPS://ARXIV.ORG/PDF/1908.08345.PDF》 *
张弛: "基于语义重构的文本摘要算法", 《中国优秀硕士学位论文全文数据库·信息科技辑》 *
胡莺夕: "基于深度学习的多实体关系识别及自动文本摘要方法研究与实现", 《中国优秀硕士学位论文全文数据库·信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116976290A (en) * 2023-06-19 2023-10-31 珠海盈米基金销售有限公司 Multi-scene information abstract generation method and device based on autoregressive model
CN116976290B (en) * 2023-06-19 2024-03-19 珠海盈米基金销售有限公司 Multi-scene information abstract generation method and device based on autoregressive model

Also Published As

Publication number Publication date
CN111046672B (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN111897949B (en) Guided text abstract generation method based on Transformer
CN110209801B (en) Text abstract automatic generation method based on self-attention network
CN110795556B (en) Abstract generation method based on fine-grained plug-in decoding
CN109190131B (en) Neural machine translation-based English word and case joint prediction method thereof
CN111694924B (en) Event extraction method and system
CN110275936B (en) Similar legal case retrieval method based on self-coding neural network
CN111178093B (en) Neural machine translation system training acceleration method based on stacking algorithm
CN111209749A (en) Method for applying deep learning to Chinese word segmentation
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN112749253B (en) Multi-text abstract generation method based on text relation graph
CN115438154A (en) Chinese automatic speech recognition text restoration method and system based on representation learning
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN113065349A (en) Named entity recognition method based on conditional random field
CN114912453A (en) Chinese legal document named entity identification method based on enhanced sequence features
CN113423004A (en) Video subtitle generating method and system based on decoupling decoding
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
CN111046672B (en) Multi-scene text abstract generation method
CN111444720A (en) Named entity recognition method for English text
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN112364647A (en) Duplicate checking method based on cosine similarity algorithm
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN110704664A (en) Hash retrieval method
CN116069924A (en) Text abstract generation method and system integrating global and local semantic features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 12 / F, building 1, Aosheng building, 1166 Xinluo street, hi tech Zone, Jinan City, Shandong Province

Patentee after: Zhongyang Health Technology Group Co.,Ltd.

Address before: 12 / F, building 1, Aosheng building, 1166 Xinluo street, high tech Zone, Jinan City, Shandong Province

Patentee before: SHANDONG MSUNHEALTH TECHNOLOGY GROUP Co.,Ltd.

CP03 Change of name, title or address