WO2019169719A1 - 文摘自动提取方法、装置、计算机设备及存储介质 - Google Patents

文摘自动提取方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2019169719A1
WO2019169719A1 PCT/CN2018/085249 CN2018085249W WO2019169719A1 WO 2019169719 A1 WO2019169719 A1 WO 2019169719A1 CN 2018085249 W CN2018085249 W CN 2018085249W WO 2019169719 A1 WO2019169719 A1 WO 2019169719A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
word
state
implicit
lstm
Prior art date
Application number
PCT/CN2018/085249
Other languages
English (en)
French (fr)
Inventor
林林
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to JP2019557629A priority Critical patent/JP6955580B2/ja
Priority to US16/645,491 priority patent/US20200265192A1/en
Priority to SG11202001628VA priority patent/SG11202001628VA/en
Publication of WO2019169719A1 publication Critical patent/WO2019169719A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of abstract extraction technology, and in particular, to an automatic extraction method, device, computer device and storage medium.
  • the extractive abstract is to extract the most representative key sentence in the article as the abstract of the article. details as follows:
  • the above-mentioned decoupling method is more suitable for the styles in which the summative long sentences often appear in the text, such as news and argumentative papers.
  • high-frequency words are often “cash”, “stock”, “central bank”, “interest”, etc.
  • the result is often “the central bank raises interest rates and causes stock prices to fall, and cash is already known as stocks”.
  • the decoupling method has a lot of limitations. If the representative "key sentence" is missing from the processed text, the result of the extraction is likely to be meaningless, especially the text of the conversation class.
  • the present application provides an automatic extracting method, device, computer device and storage medium, which aims to solve the problem that the abstract method in the prior art adopts the extractive method to extract the abstract in the article, which is only applicable to news, argumentative papers, etc.
  • the style of the text, extracting abstracts from the text without key sentences is inaccurate.
  • the present application provides an automatic extraction method for an abstract, which includes: sequentially acquiring characters included in a target text, and sequentially inputting characters into a first layer LSTM structure in an LSTM model to obtain an implicit state.
  • a sequence consisting of; the LSTM model is a long and short memory neural network; the sequence consisting of the implicit state is input to the second layer LSTM structure in the LSTM model for decoding, and the word sequence of the digest is obtained; the sequence of the word of the digest is input to the LSTM
  • the first layer of LSTM structure in the model is encoded to obtain a sequence consisting of the updated implicit state; according to the contribution value of the hidden state of the encoder in the sequence composed of the updated implicit state, the corresponding value of the hidden state of the encoder is obtained.
  • Context vector according to the sequence and the context vector composed of the updated implied state, obtain the probability distribution of the words in the sequence consisting of the updated implied state, and output the word with the highest probability in the probability distribution of the word as the target
  • an automatic extracting apparatus for an abstract which includes:
  • a second input unit configured to input a sequence consisting of an implicit state into a second layer LSTM structure in the LSTM model for decoding, to obtain a word sequence of the digest
  • a third input unit configured to input the word sequence of the digest into the first layer LSTM structure in the LSTM model to obtain a sequence consisting of the updated implied state
  • a context vector obtaining unit configured to acquire a context vector corresponding to a contribution value of the hidden state of the encoder according to a contribution value of the encoder hidden state in the sequence composed of the updated implicit state;
  • a summary obtaining unit configured to obtain a probability distribution of the words in the sequence consisting of the updated implicit state according to the sequence and the context vector composed of the updated implied state, and output the word with the highest probability in the probability distribution of the word as the target A summary of the text.
  • the present application further provides a computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor implementing the computer program
  • a computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor implementing the computer program
  • the present application also provides a storage medium, wherein the storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, causing the processor to execute the application
  • a storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, causing the processor to execute the application
  • the application provides an automatic extraction method, device, computer device and storage medium.
  • the method uses the LSTM model to encode and decode the target text, and combines the context variables to obtain a summary of the target text. It summarizes the summary of the target text and improves the accuracy of the abstraction.
  • FIG. 1 is a schematic flowchart of an automatic extracting method according to an embodiment of the present application
  • FIG. 2 is another schematic flowchart of an automatic extracting method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a sub-flow of an automatic extracting method according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of an automatic extracting apparatus according to an embodiment of the present application.
  • FIG. 5 is another schematic block diagram of an automatic extracting apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a subunit of an automatic extracting apparatus according to an embodiment of the present disclosure
  • FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of an automatic extracting method according to an embodiment of the present application.
  • the method is applied to terminals such as desktop computers, laptop computers, and tablet computers.
  • the method includes steps S101 to S105.
  • S101 sequentially acquire characters included in the target text, and sequentially input the characters into the first layer LSTM structure in the LSTM model to obtain a sequence consisting of an implicit state; wherein the LSTM model is a long and short memory neural network.
  • the characters included in the target text are first obtained by word segmentation, and the obtained characters are Chinese characters or English characters.
  • the target text is split into a plurality of characters. For example, when segmenting a Chinese article, the following steps are taken:
  • the LSTM model is input for processing.
  • the LSTM model is a long and short memory neural network.
  • the full name of LSTM is Long Short-Term Memory, which is a time recurrent neural network.
  • LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.
  • the LSTM model can encode the characters included in the target text, and perform pre-processing of the abstract extraction of the text.
  • the key to LSTM is the Cell State, which can be thought of as a horizontal line across the top of the entire cell.
  • the cell state is similar to a conveyor belt, which passes directly through the entire chain, with only a few small linear interactions.
  • the information carried on the cell state can easily flow without changing.
  • the LSTM has the ability to add or delete information to the cell state.
  • the above capabilities are controlled by the structure of the gate, ie the gate can selectively pass information, wherein the gate structure It consists of a Sigmoid neural network layer and an element-level multiplication operation.
  • the Sigmoid layer outputs values between 0 and 1, each value indicating whether the corresponding partial information should pass. A value of 0 means that information is not allowed to pass, and a value of 1 means that all information is passed.
  • An LSTM has three gates to protect and control the state of the cell.
  • the LSTM includes at least three doors, as follows:
  • Input gate which determines how much of the network input is saved to the unit state at the current time
  • Input gate which determines how much of the unit state is output to the current output value of the LSTM.
  • the LSTM model is a threshold loop unit, and the model of the threshold loop unit is as follows:
  • W z , W r , W are the weighted parameter values obtained by training, x t is the input, h t-1 is the implicit state, z t is the update state, and r t is the reset signal. Is a new memory corresponding to the implicit state h t-1 , h t is the output, ⁇ () is the sigmoid function, and tanh () is the hyperbolic tangent function.
  • the characters included in the target text are encoded by the first layer LSTM structure, and converted into a sequence consisting of hidden states. After continuing decoding, the sequence after the initial processing can be obtained, and the precise extraction of the word segments to be selected is realized.
  • the method before the step S101, the method further includes:
  • the overall framework of the LSTM model is fixed. You only need to set the parameters of each layer such as input layer, hidden layer and output layer to get the model. The parameters of each layer such as input layer, hidden layer and output layer can be tested. Get the optimal parameter values multiple times. For example, if there are 10 nodes in the hidden layer node, and the value of each node can be taken from 1 to 10, then 100 combinations will be tried to get 100 training models, and then the 100 models will be trained with a large amount of data, according to the accuracy. Rate to obtain an optimal training model.
  • the parameters such as the node value corresponding to the optimal training model are the optimal parameters (it can be understood that W z , W r , W in the above GRU model is the optimal here). parameter). Applying the optimal training model to the scheme as the LSTM model ensures that the extracted abstracts are more accurate.
  • the step S102 includes the following sub-steps:
  • S1021 Acquire a word with the highest probability among the sequences consisting of the implicit state, and use the word with the highest probability among the sequences composed of the hidden states as the initial word in the word sequence of the digest;
  • S1022 Input each word in the initial bit word into a second layer LSTM structure, and combine each word in the vocabulary of the second layer LSTM structure to obtain a combined sequence, and obtain a word with the highest probability in the combined sequence as a hidden a sequence consisting of states;
  • Each word in the sequence consisting of repeatedly performing implicit states is input to the second layer LSTM structure, and each word in the vocabulary of the second layer LSTM structure is combined to obtain a combined sequence, and the probability of the combination is obtained.
  • the step of the word as a sequence consisting of an implicit state until the detection of each word in the sequence of the implicit state is combined with the terminator in the vocabulary, and the sequence consisting of the implicit state is used as the word sequence of the digest .
  • the above process that is, the Beam Search algorithm (Beam Search algorithm, that is, the cluster search algorithm), is one of methods for decoding a sequence consisting of an implicit state, and the specific process is as follows:
  • the Beam Search algorithm is only needed during actual use (ie during the test process) and is not needed during training. When you are training, you don't need to do this search because you know the correct answer.
  • the vocabulary size is 3, and the content is a, b, c.
  • the beam search algorithm finally outputs the number of sequences (the available size indicates the final output sequence number) is 2, and the decoder (the second layer LSTM structure can be regarded as the decoder decoder) is decoded:
  • the sequence consisting of the implicit state is input to the second layer LSTM structure in the LSTM model for decoding, and the word sequence of the digest is a polynomial of the same size as the vocabulary.
  • the target text x t is set to an end flag (such as the period at the end of the text), and one word in the target text is input to the first layer LSTM structure each time, and when the end of the target text x t is reached, the target text x is represented.
  • the sequence consisting of the implicit state obtained by t coding ie, the hidden state vector
  • the second layer LSTM structure outputs the softmax layer (softmax layer or polynomial distribution layer) having the same size as the vocabulary.
  • the component in the softmax layer represents the probability of each word; when the output layer of the LSTM is softmax, the output of each moment produces a vector y t ⁇ R K , where K is the size of the vocabulary, and the kth in the y t vector The dimension represents the probability of generating the kth word.
  • the probability of each word in the word sequence of the abstract is represented by a vector, which is more conducive to its reference as the input of the next data processing.
  • the word sequence of the abstract is input into the first layer LSTM structure in the LSTM model for encoding, for the second time processing, to select the most likely word from the abstract word sequence as the abstract.
  • Composition words are input into the first layer LSTM structure in the LSTM model for encoding, for the second time processing, to select the most likely word from the abstract word sequence as the abstract.
  • the contribution value of the hidden state of the encoder represents a weighted sum of all its hidden states, wherein the highest weight corresponds to the maximum contribution of the decoder in determining the enhanced hidden state of the next word and the most important Hidden state. In this way, the context vector that can represent the abstract can be obtained more accurately.
  • a t,i is the weight of the feature vector of the i-th position when the t-th word is generated
  • L is the number of characters in the sequence of the hidden state after the update.
  • each piece of text of the target text is processed, and each paragraph is summarized by the above steps, and finally combined into a completed summary.
  • the method uses LSTM to encode and decode the target text, and combines the context variables to obtain a summary of the target text, and obtains a summary in a general manner to improve the accuracy of the acquisition.
  • the embodiment of the present application further provides an automatic extracting apparatus for performing the automatic extraction method of any of the foregoing abstracts.
  • FIG. 4 is a schematic block diagram of an automatic extracting apparatus according to an embodiment of the present application.
  • the abstract automatic extraction device 100 can be installed in a desktop computer, a tablet computer, a laptop computer, or the like.
  • the abstract automatic extracting apparatus 100 includes a first input unit 101, a second input unit 102, a third input unit 103, a context vector obtaining unit 104, and a digest obtaining unit 105.
  • the first input unit 101 is configured to sequentially acquire characters included in the target text, and sequentially input the characters into the first layer LSTM structure in the LSTM model to obtain a sequence consisting of an implicit state; wherein the LSTM model is a long and short memory. Neural Networks.
  • the characters included in the target text are first obtained by word segmentation, and the obtained characters are Chinese characters or English characters.
  • the target text is split into a plurality of characters. For example, when segmenting a Chinese article, the following steps are taken:
  • the final digest can be extracted from the plurality of participles to form the words that can constitute the abstract.
  • the above-mentioned word segmentation processing may be performed in units of natural segments, the key sentences of the current natural segment are extracted, and finally the key sentences of each segment are combined to form a digest (this word segmentation processing is preferred in the present application).
  • the above word segmentation process may be directly performed on a whole article, and multiple keywords may be extracted and combined into a summary.
  • the LSTM model is input for processing.
  • the LSTM model is a long and short memory neural network.
  • the full name of LSTM is Long Short-Term Memory, which is a time recurrent neural network.
  • LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.
  • the LSTM model can encode the characters included in the target text, and perform pre-processing of the abstract extraction of the text.
  • the key to LSTM is the Cell State, which can be thought of as a horizontal line across the top of the entire cell.
  • the cell state is similar to a conveyor belt, which passes directly through the entire chain, with only a few small linear interactions.
  • the information carried on the cell state can easily flow without changing.
  • the LSTM has the ability to add or delete information to the cell state.
  • the above capabilities are controlled by the structure of the gate, ie the gate can selectively pass information, wherein the gate structure It consists of a Sigmoid neural network layer and an element-level multiplication operation.
  • the Sigmoid layer outputs values between 0 and 1, each value indicating whether the corresponding partial information should pass. A value of 0 means that information is not allowed to pass, and a value of 1 means that all information is passed.
  • An LSTM has three gates to protect and control the state of the cell.
  • the LSTM includes at least three doors, as follows:
  • Forgetting gate which determines how much the state of the unit at the previous moment is retained to the current moment; 2) the input gate, which determines how much of the input of the network is saved to the unit state at the current time; 3) the input gate, which determines the unit How many states are output to the current output value of the LSTM.
  • the LSTM model is a threshold loop unit, and the model of the threshold loop unit is as follows:
  • W z , W r , W are the weighted parameter values obtained by training, x t is the input, h t-1 is the implicit state, z t is the update state, and r t is the reset signal. Is a new memory corresponding to the implicit state h t-1 , h t is the output, ⁇ () is the sigmoid function, and tanh () is the hyperbolic tangent function.
  • the characters included in the target text are encoded by the first layer LSTM structure, and converted into a sequence consisting of hidden states. After continuing decoding, the sequence after the initial processing can be obtained, and the precise extraction of the word segments to be selected is realized.
  • the automatic digest device 100 further includes:
  • the historical data training unit 101a puts a plurality of historical texts in the corpus into the first layer LSTM structure, and puts the abstracts corresponding to the historical text into the second layer LSTM structure, and performs training to obtain the LSTM model.
  • the overall framework of the LSTM model is fixed. You only need to set the parameters of each layer such as input layer, hidden layer and output layer to get the model. The parameters of each layer such as input layer, hidden layer and output layer can be tested. Get the optimal parameter values multiple times. For example, if there are 10 nodes in the hidden layer node, and the value of each node can be taken from 1 to 10, then 100 combinations will be tried to get 100 training models, and then the 100 models will be trained with a large amount of data, according to the accuracy. Rate to obtain an optimal training model.
  • the parameters such as the node value corresponding to the optimal training model are the optimal parameters (it can be understood that W z , W r , W in the above GRU model is the optimal here). parameter). Applying the optimal training model to the scheme as the LSTM model ensures that the extracted abstracts are more accurate.
  • the second input unit 102 is configured to input a sequence consisting of an implicit state into a second layer LSTM structure in the LSTM model for decoding, to obtain a sequence of words of the digest.
  • the second input unit 102 includes the following subunits:
  • the initializing unit 1021 is configured to obtain a word with the highest probability among the sequences composed of the hidden states, and use the word with the highest probability among the sequences composed of the hidden states as the initial word in the word sequence of the digest;
  • the updating unit 1022 is configured to input each word in the initial bit word into the second layer LSTM structure, and combine each word in the vocabulary of the second layer LSTM structure to obtain a combined sequence, and obtain the maximum probability in the combined sequence. a sequence of words as an implicit state;
  • the above process that is, the Beam Search algorithm (Beam Search algorithm, that is, the cluster search algorithm), is one of methods for decoding a sequence consisting of an implicit state, and the specific process is as follows:
  • the Beam Search algorithm is only needed during actual use (ie during the test process) and is not needed during training. When you are training, you don't need to do this search because you know the correct answer.
  • the vocabulary size is 3, and the content is a, b, c.
  • the beam search algorithm finally outputs the number of sequences (the available size indicates the final output sequence number) is 2, and the decoder (the second layer LSTM structure can be regarded as the decoder decoder) is decoded:
  • the sequence of words of the summary is output, and a complete summary text is not yet formed. In order to form a complete summary of the word sequence of the abstract, further processing is required.
  • the sequence consisting of the implicit state is input to the second layer LSTM structure in the LSTM model for decoding, and the word sequence of the digest is a polynomial of the same size as the vocabulary.
  • the target text x t is set to an end flag (such as the period at the end of the text), and one word in the target text is input to the first layer LSTM structure each time, and when the end of the target text x t is reached, the target text x is represented.
  • the sequence consisting of the implicit state obtained by t coding ie, the hidden state vector
  • the second layer LSTM structure outputs the softmax layer (softmax layer or polynomial distribution layer) having the same size as the vocabulary.
  • the component in the softmax layer represents the probability of each word; when the output layer of the LSTM is softmax, the output of each moment produces a vector y t ⁇ R K , where K is the size of the vocabulary, and the kth in the y t vector The dimension represents the probability of generating the kth word.
  • the probability of each word in the word sequence of the abstract is represented by a vector, which is more conducive to its reference as the input of the next data processing.
  • the third input unit 103 is configured to input the word sequence of the digest into the first layer LSTM structure in the LSTM model to obtain a sequence consisting of the updated implied state.
  • the word sequence of the abstract is input into the first layer LSTM structure in the LSTM model for encoding, for the second time processing, to select the most likely word from the abstract word sequence as the abstract.
  • Composition words are input into the first layer LSTM structure in the LSTM model for encoding, for the second time processing, to select the most likely word from the abstract word sequence as the abstract.
  • the context vector obtaining unit 104 is configured to obtain a context vector corresponding to the contribution value of the hidden state of the encoder according to the contribution value of the encoder hidden state in the sequence composed of the updated implicit state.
  • the contribution value of the hidden state of the encoder represents a weighted sum of all its hidden states, wherein the highest weight corresponds to the maximum contribution of the decoder in determining the enhanced hidden state of the next word and the most important Hidden state. In this way, the context vector that can represent the abstract can be obtained more accurately.
  • a t,i is the weight of the feature vector of the i-th position when the t-th word is generated
  • L is the number of characters in the sequence of the hidden state after the update.
  • the summary obtaining unit 105 is configured to obtain a probability distribution of the words in the sequence consisting of the updated implicit state according to the sequence and the context vector composed of the updated implied state, and output the word with the highest probability in the probability distribution of the word as A summary of the target text.
  • each piece of text of the target text is processed, and each paragraph is summarized by the above steps, and finally combined into a completed summary.
  • the device uses LSTM to encode and decode the target text, and combines the context variables to obtain a summary of the target text, and obtains a summary in a general manner to improve the accuracy of the acquisition.
  • the above abstract automatic extraction device can be implemented in the form of a computer program that can be run on a computer device as shown in FIG.
  • FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 device can be a terminal.
  • the terminal can be an electronic device such as a tablet computer, a notebook computer, a desktop computer, or a personal digital assistant.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected by a system bus 501, wherein the memory can include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform an automatic digest extraction method.
  • the processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the computer program 5032 can cause the processor 502 to perform an automatic digest extraction method.
  • the network interface 505 is used for network communication, such as sending assigned tasks and the like. It will be understood by those skilled in the art that the structure shown in FIG.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device 500 to which the solution of the present application is applied, and a specific computer device. 500 may include more or fewer components than shown, or some components may be combined, or have different component arrangements.
  • the processor 502 is configured to run a computer program 5032 stored in the memory to implement the following functions: sequentially acquiring characters included in the target text, and sequentially inputting characters into the first layer LSTM structure in the LSTM model. Encoding, obtaining a sequence consisting of an implicit state; wherein the LSTM model is a long and short memory neural network; the sequence consisting of the implicit state is input to the second layer LSTM structure in the LSTM model for decoding, to obtain a summary word sequence; The word sequence is input into the first layer LSTM structure in the LSTM model to obtain a sequence consisting of the updated implicit state; the contribution value of the hidden state of the encoder in the sequence composed of the updated implicit state is obtained and hidden by the encoder.
  • the context vector corresponding to the contribution value of the state according to the sequence and the context vector composed of the updated implied state, obtain the probability distribution of the words in the sequence consisting of the updated implied state, and the word with the highest probability among the probability distributions of the words
  • the word output is a summary of the target text.
  • the processor 502 further performs the following operations: placing a plurality of historical texts in the corpus into the first layer LSTM structure, and placing the abstracts corresponding to the historical text into the second layer LSTM structure, and training to obtain the LSTM model. .
  • the LSTM model is a threshold loop unit, and the model of the threshold loop unit is as follows:
  • W z , W r , W are the weighted parameter values obtained by training, x t is the input, h t-1 is the implicit state, z t is the update state, and r t is the reset signal. Is a new memory corresponding to the implicit state h t-1 , h t is the output, ⁇ () is the sigmoid function, and tanh () is the hyperbolic tangent function.
  • the word sequence of the digest is a polynomial distribution layer of the same size as the vocabulary, and the vector y t ⁇ R K is output; wherein the kth dimension in y t represents the probability of generating the kth word, The value of t is a positive integer, and K is the size of the vocabulary corresponding to the historical text.
  • the processor 502 further performs the following operations: acquiring a word with the highest probability among the sequences consisting of the hidden states, and using the word with the highest probability among the sequences composed of the hidden states as the initial word in the word sequence of the abstract
  • Each word in the initial word is input to the second layer LSTM structure, and each word in the vocabulary of the second layer LSTM structure is combined to obtain a combined sequence, and the word with the highest probability in the combined sequence is obtained as an implicit a sequence consisting of states; each word in the sequence consisting of repeatedly performing implicit states is input to the second layer LSTM structure, and each word in the vocabulary of the second layer LSTM structure is combined to obtain a combined sequence, and the combined sequence is obtained.
  • the step of the most probable word as a sequence of implicit states until the detection of the combination of each word in the sequence consisting of the implicit state and the terminator in the vocabulary is stopped, and the sequence consisting of the implicit state is used as the abstract Word sequence.
  • the embodiment of the computer device shown in FIG. 7 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or fewer components than illustrated. Or combine some parts, or different parts.
  • the computer device may include only a memory and a processor. In such an embodiment, the structure and function of the memory and the processor are the same as those of the embodiment shown in FIG. 7, and details are not described herein again.
  • the processor 502 may be a central processing unit (CPU), and the processor 502 may also be another general-purpose processor, a digital signal processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc.
  • the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a storage medium in another embodiment of the present application, is provided.
  • the storage medium can be a non-transitory computer readable storage medium.
  • the storage medium stores a computer program, wherein the computer program includes program instructions. When the program instruction is executed by the processor, the automatic extracting method of the embodiment of the present application is implemented.
  • the storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device.
  • the storage medium may also be an external storage device of the device, such as a plug-in hard disk equipped on the device, a smart memory card (SMC), a secure digital (SD) card, and a flash memory card. (Flash Card), etc.
  • the storage medium may also include both an internal storage unit of the device and an external storage device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种文摘自动提取方法、装置、计算机设备及存储介质。该方法包括:依序获取目标文本的字符并按顺序输入至LSTM模型中第一层LSTM结构进行编码,得到隐含状态组成的序列;将隐含状态组成的序列输入至LSTM模型中第二层LSTM结构进行解码得到摘要的字词序列;将摘要的字词序列输入第一层LSTM结构进行编码得到更新后隐含状态组成的序列;根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值获取上下文向量,并获取对应字词的概率分布,将概率最大的字词作为目标文本的摘要。该方法采用LSTM对目标文本进行编码解码后,结合上下文变量得到目标文本的摘要,采取概括方式获取摘要,提高获取准确性。

Description

文摘自动提取方法、装置、计算机设备及存储介质
本申请要求于2018年3月8日提交中国专利局、申请号为201810191506.3、申请名称为“文摘自动提取方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及文摘提取技术领域,尤其涉及一种文摘自动提取方法、装置、计算机设备及存储介质。
背景技术
目前,对文章概括文摘时,采用的是基于抽取式的方法。抽取式文摘是提取文章中最有代表性的关键句作为该文章的文摘。具体如下:
1)首先,对文章进行分词,去停用词,获得的组成文章的基本词组。
2)然后,根据计算词频获取高频词,并把高频词所在的句子作为关键句。
3)最后,指定若干数量的关键句即可组合成文摘。
上述抽取式方法比较适用于新闻、议论文等在文中往往出现总结性长句子的文体。例如财经文章,高频词往往是“现金”、“股票”、“央行”、“利息”等,抽取结果就往往是“央行加息导致股价下跌,现金为上已成股民众识”之类的长句子。抽取式方法有很大的局限性,如果处理的文本中缺失代表性的“关键句”,那抽取结果很可能毫无意义,尤其是对话类的文本。
发明内容
本申请提供了一种文摘自动提取方法、装置、计算机设备及存储介质,旨在解决现有技术中采用抽取式方法提取文章中的文摘仅适用于新闻、议论文等在文中出现总结性长句子的文体,对无关键句的文本提取摘要提取结果不准确的问题。
第一方面,本申请提供了一种文摘自动提取方法,其包括:依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进 行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列;根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
第二方面,本申请提供了一种文摘自动提取装置,其包括:
第一输入单元,用于依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;
第二输入单元,用于将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;
第三输入单元,用于将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列;
上下文向量获取单元,用于根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;
摘要获取单元,用于根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
第三方面,本申请又提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现本申请提供的任一项所述的文摘自动提取方法。
第四方面,本申请还提供了一种存储介质,其中所述存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行本申请提供的任一项所述的文摘自动提取方法。
本申请提供一种文摘自动提取方法、装置、计算机设备及存储介质。该方法采用LSTM模型对目标文本进行编码和解码后,并结合上下文变量,得到目标文本的摘要,采取了概括的方式来总结获取目标文本的摘要,提高了文摘获 取的准确性。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种文摘自动提取方法的示意流程图;
图2为本申请实施例提供的一种文摘自动提取方法的另一示意流程图;
图3是本申请实施例提供的一种文摘自动提取方法的子流程示意图;
图4为本申请实施例提供的一种文摘自动提取装置的示意性框图;
图5为本申请实施例提供的一种文摘自动提取装置的另一示意性框图;
图6为本申请实施例提供的一种文摘自动提取装置的子单元示意性框图;
图7为本申请实施例提供的一种计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1,图1是本申请实施例提供的一种文摘自动提取方法的示意流程图。该方法应用于台式电脑、手提电脑、平板电脑等终端中。如图1所示,该方法包括步骤S101~S105。
S101、依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络。
在本实施例中,先是通过分词来获取目标文本所包括的字符,所获取的字符为中文字符或英文字符,经过上述处理后将目标文本拆分成了多个字符。例如,对一篇中文文章进行分词时,采用如下步骤:
1)、对一个待分词的子串S,按照从左到右的顺序取出全部候选词w1,w2,…,wi,…,wn;
2)、到词典中查出每个候选词的概率值P(wi),并记录每个候选词的全部左邻词;
3)、计算每个候选词的累计概率,同时比较得到每个候选词的最佳左邻词;
4)、如果当前词wn是字串S的尾词,且累计概率P(wn)最大,则wn就是S的终点词;
5)、从wn开始,按照从右到左顺序,依次将每个词的最佳左邻词输出,即S的分词结果。
依序获取了目标文本所包括的字符后,将其按顺序输入至已根据历史数据训练得到的LSTM模型,就能从多个分词中提炼出能构成摘要的词语组成最终的文摘。具体处理时,可以是以自然段为单位进行上述分词处理,提取当前自然段的关键句,最后将每段的关键句组合形成摘要(本申请中优选这一分词处理方式)。也可以是直接以一整篇文章为单位进行上述分词处理,提取多个关键词后组合成摘要。
在获取了目标文本所包括的字符后,输入LSTM模型进行处理。LSTM模型即长短记忆神经网络,其中LSTM的全称是Long Short-Term Memory,是一种时间递归神经网络,LSTM适合于处理和预测时间序列中间隔和延迟非常长的重要事件。通过LSTM模型能目标文本所包括的字符进行编码,进行文本的摘要提取的前序处理。
为了更清楚的理解LSTM模型,下面对LSTM模型进行介绍。
LSTM的关键是元胞状态(Cell State),其可以视为横穿整个元胞顶部的水平线。元胞状态类似于传送带,它直接穿过整个链,同时只有一些较小的线性交互。元胞状态上承载的信息可以很容易地流过而不改变,LSTM有能力对元胞状态添加或者删除信息,上述能力通过门的结构来控制,即门可以选择性让信息通过,其中门结构是由一个Sigmoid神经网络层和一个元素级相乘操作组成。Sigmoid层输出0~1之间的值,每个值表示对应的部分信息是否应该通过。0值表示不允许信息通过,1值表示让所有信息通过。一个LSTM有3个门,来保护和控制元胞状态。
LSTM中至少包括三个门,分别如下:
1)遗忘门,其决定了上一时刻的单元状态有多少保留到当前时刻;
2)输入门,其决定了当前时刻网络的输入有多少保存到单元状态;
3)输入门,其决定了单元状态有多少输出到LSTM的当前输出值。
在一实施例中,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
z t=σ(W z·[h t-1,x t])
r t=σ(W r·[h t-1,x t])
Figure PCTCN2018085249-appb-000001
Figure PCTCN2018085249-appb-000002
其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
Figure PCTCN2018085249-appb-000003
是与隐含状态h t-1对应的新记忆,h t是输出,σ()是sigmoid函数,tanh()是双曲正切函数。
目标文本所包括的字符通过了第一层LSTM结构进行编码,就转化成隐含状态组成的序列,对其继续进行解码就能获取初次处理后的序列,实现了对待选分词的精准提取。
在一实施例中,如图2所示,所述步骤S101之前还包括:
S101a、将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
LSTM模型的整体框架是固定的,只需要设置其输入层、隐藏层、输出层等各层的参数,就可以得到模型,其中设置输入层、隐藏层、输出层等各层的参 数可以通过实验多次来得到最优的参数值。譬如,隐藏层节点有10个节点,那每个节点的数值可以从1取到10,那么就会尝试100种组合来得到100个训练模型,然后用大量数据去训练这100个模型,根据准确率等来得到一个最优的训练模型,这个最优的训练模型对应的节点值等参数就是最优参数(可以理解为上述GRU模型中的W z、W r、W就为此处的最优参数)。用最优的训练模型来应用到本方案中作为LSTM模型,这样能确保所提取的文摘更为准确。
S102、将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列。
如图3所示,该步骤S102包括以下子步骤:
S1021、获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;
S1022、将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列;
S1023、重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
在本实施例中,上述过程也即Beam Search算法(Beam Search算法即集束搜索算法),是用于解码隐含状态组成的序列的方法之一,其具体过程如下:
1)获取隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;2)将初始位词语中的每个字与词表中的字进行组合得到第一次组合后序列,获取第一次组合后序列中概率最大的词作第一次更新后序列;重复上述过程直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,最终输出摘要的字词序列。
Beam Search算法只在实际使用过程中(即test过程中)的时候需要,在训练过程中并不需要。训练的时候由于知道正确答案,并不需要再进行这个搜索。而在实际使用的时候,假设词表大小为3,内容为a,b,c。beam search算法最终输出序列个数(可用size表示最终输出序列个数)是2,decoder(第二层LSTM 结构可以视为解码器decoder)解码的时候:
生成第1个词的时候,选择概率最大的2个词,假设为a,c,那么当前序列就是a c;生成第2个词的时候,我们将当前序列a和c,分别与词表中的所有词进行组合,得到新的6个序列aa、ab、ac、ca、cb、cc,然后从其中选择2个得分最高的作为当前序列,假如为aa cb;后面会不断重复这个过程,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,最终输出2个得分最高的序列。将目标文本经过编码和解码后输出摘要的字词序列,此时还未组成一段完整的摘要文字。为了将摘要的字词序列组成一段完整的摘要,需要进行进一步的处理。
在一实施例中,将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
其中,将目标文本x t设置结束标志(如文本末尾的句号),每次将目标文本中的一个词输入到第一层LSTM结构,当到达目标文本x t的末尾时,则表示目标文本x t编码得到的隐含状态组成的序列(即hidden state vector)将作为第二层LSTM结构的输入进行解码,第二层LSTM结构输出与词表大小相同的softmax层(softmax层即多项式分布层),softmax层中的分量代表每个词语的概率;当LSTM的输出层为softmax时,每个时刻输出会产生向量y t∈R K,K即为词表的大小,y t向量中的第k维代表生成第k个词语的概率。通过向量来表示摘要的字词序列中每一词语的概率,更利于其作为下一次数据处理的输入的参考。
S103、将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列。
在本实施例中,将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,是为了二次进行处理,以从摘要的字词序列选取最有可能的字词作为摘要的组成词。
S104、根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量。
在本实施例中,编码器隐藏状态的贡献值代表了他的所有隐藏状态的加权 和,其中最高的权重对应了解码器在决定下一个词是考虑的增强隐藏状态的最大贡献以及最重要的隐藏状态。通过这一方式,能更准确的获取能代表文摘的上下文向量。
例如,将更新后隐含状态组成的序列转化为特征向量a,其中a={a 1,a 2,……,a L},则上下文向量Z t用下式表示:
Figure PCTCN2018085249-appb-000004
其中,a t,i就是衡量生成第t个词语时,第i个位置的特征向量所占的权重,L为更新后隐含状态组成的序列中字符的个数。
S105、根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
在本实施例中,对目标文本的每一段文字进行处理,每一段都通过上述步骤来概括摘要,最后组合成一个完成的摘要。
可见,该方法采用LSTM对目标文本进行编码解码后,结合上下文变量得到目标文本的摘要,采取概括方式获取摘要,提高获取准确性。
本申请实施例还提供一种文摘自动提取装置,该文摘自动提取装置用于执行前述任一项文摘自动提取方法。具体地,请参阅图4,图4是本申请实施例提供的一种文摘自动提取装置的示意性框图。文摘自动提取装置100可以安装于台式电脑、平板电脑、手提电脑、等终端中。
如图4所示,文摘自动提取装置100包括第一输入单元101、第二输入单元102、第三输入单元103、上下文向量获取单元104、摘要获取单元105。
第一输入单元101,用于依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络。
在本实施例中,先是通过分词来获取目标文本所包括的字符,所获取的字符为中文字符或英文字符,经过上述处理后将目标文本拆分成了多个字符。例如,对一篇中文文章进行分词时,采用如下步骤:
1)、对一个待分词的子串S,按照从左到右的顺序取出全部候选词w1,w2,…,wi,…·,wn;2)、到词典中查出每个候选词的概率值P(wi),并记录每个 候选词的全部左邻词;3)、计算每个候选词的累计概率,同时比较得到每个候选词的最佳左邻词;4)、如果当前词wn是字串S的尾词,且累计概率P(wn)最大,则wn就是S的终点词;5)、从wn开始,按照从右到左顺序,依次将每个词的最佳左邻词输出,即S的分词结果。
依序获取了目标文本所包括的字符后,将其按顺序输入至已根据历史数据训练得到的LSTM模型,就能从多个分词中提炼出能构成摘要的词语组成最终的文摘。具体处理时,可以是以自然段为单位进行上述分词处理,提取当前自然段的关键句,最后将每段的关键句组合形成摘要(本申请中优选这一分词处理方式)。也可以是直接以一整篇文章为单位进行上述分词处理,提取多个关键词后组合成摘要。
在获取了目标文本所包括的字符后,输入LSTM模型进行处理。LSTM模型即长短记忆神经网络,其中LSTM的全称是Long Short-Term Memory,是一种时间递归神经网络,LSTM适合于处理和预测时间序列中间隔和延迟非常长的重要事件。通过LSTM模型能目标文本所包括的字符进行编码,进行文本的摘要提取的前序处理。
为了更清楚的理解LSTM模型,下面对LSTM模型进行介绍。
LSTM的关键是元胞状态(Cell State),其可以视为横穿整个元胞顶部的水平线。元胞状态类似于传送带,它直接穿过整个链,同时只有一些较小的线性交互。元胞状态上承载的信息可以很容易地流过而不改变,LSTM有能力对元胞状态添加或者删除信息,上述能力通过门的结构来控制,即门可以选择性让信息通过,其中门结构是由一个Sigmoid神经网络层和一个元素级相乘操作组成。Sigmoid层输出0~1之间的值,每个值表示对应的部分信息是否应该通过。0值表示不允许信息通过,1值表示让所有信息通过。一个LSTM有3个门,来保护和控制元胞状态。
LSTM中至少包括三个门,分别如下:
1)遗忘门,其决定了上一时刻的单元状态有多少保留到当前时刻;2)输入门,其决定了当前时刻网络的输入有多少保存到单元状态;3)输入门,其决定了单元状态有多少输出到LSTM的当前输出值。
在一实施例中,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
z t=σ(W z·[h t-1,x t])
r t=σ(W r·[h t-1,x t])
Figure PCTCN2018085249-appb-000005
Figure PCTCN2018085249-appb-000006
其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
Figure PCTCN2018085249-appb-000007
是与隐含状态h t-1对应的新记忆,h t是输出,σ()是sigmoid函数,tanh()是双曲正切函数。
目标文本所包括的字符通过了第一层LSTM结构进行编码,就转化成隐含状态组成的序列,对其继续进行解码就能获取初次处理后的序列,实现了对待选分词的精准提取。
在一实施例中,如图5所示,所述文摘自动提取装置100还包括:
历史数据训练单元101a、将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
LSTM模型的整体框架是固定的,只需要设置其输入层、隐藏层、输出层等各层的参数,就可以得到模型,其中设置输入层、隐藏层、输出层等各层的参数可以通过实验多次来得到最优的参数值。譬如,隐藏层节点有10个节点,那每个节点的数值可以从1取到10,那么就会尝试100种组合来得到100个训练模型,然后用大量数据去训练这100个模型,根据准确率等来得到一个最优的训练模型,这个最优的训练模型对应的节点值等参数就是最优参数(可以理解为上述GRU模型中的W z、W r、W就为此处的最优参数)。用最优的训练模型来应用到本方案中作为LSTM模型,这样能确保所提取的文摘更为准确。
第二输入单元102,用于将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列。
如图6所示,所述第二输入单元102包括以下子单元:
初始化单元1021,用于获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;
更新单元1022,用于将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序 列中概率最大的词作为隐含状态组成的序列;
重复执行单元1023,用于重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
在本实施例中,上述过程也即Beam Search算法(Beam Search算法即集束搜索算法),是用于解码隐含状态组成的序列的方法之一,其具体过程如下:
1)获取隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;2)将初始位词语中的每个字与词表中的字进行组合得到第一次组合后序列,获取第一次组合后序列中概率最大的词作第一次更新后序列;重复上述过程直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,最终输出摘要的字词序列。
Beam Search算法只在实际使用过程中(即test过程中)的时候需要,在训练过程中并不需要。训练的时候由于知道正确答案,并不需要再进行这个搜索。而在实际使用的时候,假设词表大小为3,内容为a,b,c。beam search算法最终输出序列个数(可用size表示最终输出序列个数)是2,decoder(第二层LSTM结构可以视为解码器decoder)解码的时候:
生成第1个词的时候,选择概率最大的2个词,假设为a,c,那么当前序列就是a c;生成第2个词的时候,我们将当前序列a和c,分别与词表中的所有词进行组合,得到新的6个序列aa、ab、ac、ca、cb、cc,然后从其中选择2个得分最高的作为当前序列,假如为aa cb;后面会不断重复这个过程,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,最终输出2个得分最高的序列。
将目标文本经过编码和解码后输出摘要的字词序列,此时还未组成一段完整的摘要文字。为了将摘要的字词序列组成一段完整的摘要,需要进行进一步的处理。
在一实施例中,将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词 语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
其中,将目标文本x t设置结束标志(如文本末尾的句号),每次将目标文本中的一个词输入到第一层LSTM结构,当到达目标文本x t的末尾时,则表示目标文本x t编码得到的隐含状态组成的序列(即hidden state vector)将作为第二层LSTM结构的输入进行解码,第二层LSTM结构输出与词表大小相同的softmax层(softmax层即多项式分布层),softmax层中的分量代表每个词语的概率;当LSTM的输出层为softmax时,每个时刻输出会产生向量y t∈R K,K即为词表的大小,y t向量中的第k维代表生成第k个词语的概率。通过向量来表示摘要的字词序列中每一词语的概率,更利于其作为下一次数据处理的输入的参考。
第三输入单元103,用于将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列。
在本实施例中,将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,是为了二次进行处理,以从摘要的字词序列选取最有可能的字词作为摘要的组成词。
上下文向量获取单元104,用于根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量。
在本实施例中,编码器隐藏状态的贡献值代表了他的所有隐藏状态的加权和,其中最高的权重对应了解码器在决定下一个词是考虑的增强隐藏状态的最大贡献以及最重要的隐藏状态。通过这一方式,能更准确的获取能代表文摘的上下文向量。
例如,将更新后隐含状态组成的序列转化为特征向量a,其中a={a 1,a 2,……,a L},则上下文向量Z t用下式表示:
Figure PCTCN2018085249-appb-000008
其中,a t,i就是衡量生成第t个词语时,第i个位置的特征向量所占的权重,L为更新后隐含状态组成的序列中字符的个数。
摘要获取单元105,用于根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
在本实施例中,对目标文本的每一段文字进行处理,每一段都通过上述步骤来概括摘要,最后组合成一个完成的摘要。
可见,该装置采用LSTM对目标文本进行编码解码后,结合上下文变量得到目标文本的摘要,采取概括方式获取摘要,提高获取准确性。
上述文摘自动提取装置可以实现为一种计算机程序的形式,该计算机程序可以在如图7所示的计算机设备上运行。
请参阅图7,图7是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500设备可以是终端。该终端可以是平板电脑、笔记本电脑、台式电脑、个人数字助理等电子设备。
参阅图7,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032包括程序指令,该程序指令被执行时,可使得处理器502执行一种文摘自动提取方法。该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行一种文摘自动提取方法。该网络接口505用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现如下功能:依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列;根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;根据更新后隐含状态组成的序列及上下文向 量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
在一实施例中,处理器502还执行如下操作:将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
在一实施例中,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
z t=σ(W z·[h t-1,x t])
r t=σ(W r·[h t-1,x t])
Figure PCTCN2018085249-appb-000009
Figure PCTCN2018085249-appb-000010
其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
Figure PCTCN2018085249-appb-000011
是与隐含状态h t-1对应的新记忆,h t是输出,σ()是sigmoid函数,tanh()是双曲正切函数。
在一实施例中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
在一实施例中,处理器502还执行如下操作:获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列;重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
本领域技术人员可以理解,图7中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实 施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图7所示实施例一致,在此不再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供一种存储介质。该存储介质可以为非易失性的计算机可读存储介质。该存储介质存储有计算机程序,其中计算机程序包括程序指令。该程序指令被处理器执行时实现本申请实施例的文摘自动提取方法。
所述存储介质可以是前述设备的内部存储单元,例如设备的硬盘或内存。所述存储介质也可以是所述设备的外部存储设备,例如所述设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储介质还可以既包括所述设备的内部存储单元也包括外部存储设备。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种文摘自动提取方法,其特征在于,包括:
    依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;
    将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;
    将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列;
    根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;
    根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
  2. 根据权利要求1所述的文摘自动提取方法,其特征在于,所述依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列之前,还包括:
    将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
  3. 根据权利要求1所述的文摘自动提取方法,其特征在于,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
    z t=σ(W z·[h t-1,x t])
    r t=σ(W r·[h t-1,x t])
    Figure PCTCN2018085249-appb-100001
    Figure PCTCN2018085249-appb-100002
    其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
    Figure PCTCN2018085249-appb-100003
    是与隐含状态h t-1对应的新记忆,h t是输出,σ ()是sigmoid函数,tanh()是双曲正切函数。
  4. 根据权利要求3所述的文摘自动提取方法,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
  5. 根据权利要求2所述的文摘自动提取方法,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列,包括:
    获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;
    将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列;
    重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
  6. 一种文摘自动提取装置,其特征在于,包括:
    第一输入单元,用于依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;
    第二输入单元,用于将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;
    第三输入单元,用于将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列;
    上下文向量获取单元,用于根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;
    摘要获取单元,用于根据更新后隐含状态组成的序列及上下文向量,获取 更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
  7. 根据权利要求6所述的文摘自动提取装置,其特征在于,还包括:
    历史数据训练单元,用于将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
  8. 根据权利要求7所述的文摘自动提取装置,其特征在于,所述第二输入单元,包括:
    初始化单元,用于获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;
    更新单元,用于将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列;
    重复执行单元,用于重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
  9. 根据权利要求6所述的文摘自动提取装置,其特征在于,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
    z t=σ(W z·[h t-1,x t])
    r t=σ(W r·[h t-1,x t])
    Figure PCTCN2018085249-appb-100004
    Figure PCTCN2018085249-appb-100005
    其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
    Figure PCTCN2018085249-appb-100006
    是与隐含状态h t-1对应的新记忆,h t是输出,σ()是sigmoid函数,tanh()是双曲正切函数。
  10. 根据权利要求9所述的文摘自动提取装置,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的 字词序列中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现以下步骤:
    依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;
    将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;
    将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得到更新后隐含状态组成的序列;
    根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;
    根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列之前,还包括:
    将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
  13. 根据权利要求11所述的计算机设备,其特征在于,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
    z t=σ(W z·[h t-1,x t])
    r t=σ(W r·[h t-1,x t])
    Figure PCTCN2018085249-appb-100007
    Figure PCTCN2018085249-appb-100008
    其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
    Figure PCTCN2018085249-appb-100009
    是与隐含状态h t-1对应的新记忆,h t是输出,σ()是sigmoid函数,tanh()是双曲正切函数。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
  15. 根据权利要求12所述的计算机设备,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列,包括:
    获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率最大的词作为摘要的字词序列中的初始位词语;
    将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列;
    重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
  16. 一种存储介质,其特征在于,所述存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行以下操作:
    依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列;其中LSTM模型为长短记忆神经网络;
    将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列;
    将摘要的字词序列输入至LSTM模型中的第一层LSTM结构进行编码,得 到更新后隐含状态组成的序列;
    根据更新后隐含状态组成的序列中编码器隐藏状态的贡献值,获取与编码器隐藏状态的贡献值相对应的上下文向量;
    根据更新后隐含状态组成的序列及上下文向量,获取更新后隐含状态组成的序列中字词的概率分布,将字词的概率分布中概率最大的字词输出作为目标文本的摘要。
  17. 根据权利要求16所述的存储介质,其特征在于,所述依序获取目标文本所包括的字符,将字符按顺序输入至LSTM模型中的第一层LSTM结构进行编码,得到隐含状态组成的序列之前,还包括:
    将语料库中的多篇历史文本置入第一层LSTM结构,并将历史文本对应的文摘置入第二层LSTM结构,进行训练得到LSTM模型。
  18. 根据权利要求16所述的存储介质,其特征在于,所述LSTM模型为门限循环单元,所述门限循环单元的模型如下:
    z t=σ(W z·[h t-1,x t])
    r t=σ(W r·[h t-1,x t])
    Figure PCTCN2018085249-appb-100010
    Figure PCTCN2018085249-appb-100011
    其中,W z、W r、W是训练得到的权值参数值,x t是输入,h t-1是隐含状态,z t是更新状态,r t是重置信号,
    Figure PCTCN2018085249-appb-100012
    是与隐含状态h t-1对应的新记忆,h t是输出,σ()是sigmoid函数,tanh()是双曲正切函数。
  19. 根据权利要求18所述的存储介质,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列中,所述摘要的字词序列为与词表大小相同的多项式分布层,并输出向量y t∈R K;其中y t中的第k维代表生成第k个词语的概率,t的取值为正整数,K为历史文本所对应词表的大小。
  20. 根据权利要求17所述的存储介质,其特征在于,所述将隐含状态组成的序列输入至LSTM模型中的第二层LSTM结构进行解码,得到摘要的字词序列,包括:
    获取隐含状态组成的序列中概率最大的词,将隐含状态组成的序列中概率 最大的词作为摘要的字词序列中的初始位词语;
    将初始位词语中的每个字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列;
    重复执行隐含状态组成的序列中每一字输入至第二层LSTM结构,与第二层LSTM结构的词表中每一字进行组合得到组合后序列,获取组合后序列中概率最大的词作为隐含状态组成的序列的步骤,直至检测到隐含状态组成的序列中的每一字与词表中的终止符组合时停止,并将隐含状态组成的序列作为摘要的字词序列。
PCT/CN2018/085249 2018-03-08 2018-05-02 文摘自动提取方法、装置、计算机设备及存储介质 WO2019169719A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2019557629A JP6955580B2 (ja) 2018-03-08 2018-05-02 文書要約自動抽出方法、装置、コンピュータ機器及び記憶媒体
US16/645,491 US20200265192A1 (en) 2018-03-08 2018-05-02 Automatic text summarization method, apparatus, computer device, and storage medium
SG11202001628VA SG11202001628VA (en) 2018-03-08 2018-05-02 Automatic text summarization method, apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810191506.3 2018-03-08
CN201810191506.3A CN108509413A (zh) 2018-03-08 2018-03-08 文摘自动提取方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2019169719A1 true WO2019169719A1 (zh) 2019-09-12

Family

ID=63377345

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/085249 WO2019169719A1 (zh) 2018-03-08 2018-05-02 文摘自动提取方法、装置、计算机设备及存储介质

Country Status (5)

Country Link
US (1) US20200265192A1 (zh)
JP (1) JP6955580B2 (zh)
CN (1) CN108509413A (zh)
SG (1) SG11202001628VA (zh)
WO (1) WO2019169719A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737769A (zh) * 2019-10-21 2020-01-31 南京信息工程大学 一种基于神经主题记忆的预训练文本摘要生成方法
CN111178053A (zh) * 2019-12-30 2020-05-19 电子科技大学 一种结合语义和文本结构进行生成式摘要抽取的文本生成方法
CN111199727A (zh) * 2020-01-09 2020-05-26 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN112507188A (zh) * 2020-11-30 2021-03-16 北京百度网讯科技有限公司 候选搜索词的生成方法、装置、设备及介质
CN113449096A (zh) * 2020-03-24 2021-09-28 北京沃东天骏信息技术有限公司 生成文本摘要的方法和装置
EP3896595A1 (en) * 2020-04-17 2021-10-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Text key information extracting method, apparatus, electronic device, storage medium, and computer program product

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6842167B2 (ja) * 2017-05-08 2021-03-17 国立研究開発法人情報通信研究機構 要約生成装置、要約生成方法及びコンピュータプログラム
US11334612B2 (en) * 2018-02-06 2022-05-17 Microsoft Technology Licensing, Llc Multilevel representation learning for computer content quality
CN111507087B (zh) * 2018-05-31 2022-08-26 腾讯科技(深圳)有限公司 消息摘要的生成方法和装置
CN111428516B (zh) 2018-11-19 2022-08-19 腾讯科技(深圳)有限公司 一种信息处理的方法以及装置
CN109635302B (zh) * 2018-12-17 2022-06-10 北京百度网讯科技有限公司 一种训练文本摘要生成模型的方法和装置
CN110032729A (zh) * 2019-02-13 2019-07-19 北京航空航天大学 一种基于神经图灵机的自动摘要生成方法
WO2020227970A1 (en) * 2019-05-15 2020-11-19 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for generating abstractive text summarization
CN110210024B (zh) * 2019-05-28 2024-04-02 腾讯科技(深圳)有限公司 一种信息处理方法、装置及存储介质
WO2021042517A1 (zh) * 2019-09-02 2021-03-11 平安科技(深圳)有限公司 基于人工智能的文章主旨提取方法、装置及存储介质
CN111460131A (zh) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 公文摘要提取方法、装置、设备及计算机可读存储介质
US11593556B2 (en) * 2020-05-26 2023-02-28 Mastercard International Incorporated Methods and systems for generating domain-specific text summarizations
CN111797225B (zh) * 2020-06-16 2023-08-22 北京北大软件工程股份有限公司 一种文本摘要生成方法和装置
KR102539601B1 (ko) * 2020-12-03 2023-06-02 주식회사 포티투마루 텍스트 요약 성능 개선 방법 및 시스템
KR102462758B1 (ko) * 2020-12-16 2022-11-02 숭실대학교 산학협력단 노이즈 추가 기반 커버리지와 단어 연관을 이용한 문서 요약 방법, 이를 수행하기 위한 기록 매체 및 장치
CN113010666B (zh) * 2021-03-18 2023-12-08 京东科技控股股份有限公司 摘要生成方法、装置、计算机系统及可读存储介质
CN113268586A (zh) * 2021-05-21 2021-08-17 平安科技(深圳)有限公司 文本摘要生成方法、装置、设备及存储介质
CN113379032A (zh) * 2021-06-08 2021-09-10 全球能源互联网研究院有限公司 基于分层双向lstm序列模型训练方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383817A (zh) * 2016-09-29 2017-02-08 北京理工大学 利用分布式语义信息的论文标题生成方法
CN106598921A (zh) * 2016-12-12 2017-04-26 清华大学 基于lstm模型的现代文到古诗的转换方法及装置
CN106980683A (zh) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 基于深度学习的博客文本摘要生成方法
CN107484017A (zh) * 2017-07-25 2017-12-15 天津大学 基于注意力模型的有监督视频摘要生成方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102363369B1 (ko) * 2014-01-31 2022-02-15 구글 엘엘씨 문서들의 벡터 표현들 생성하기
US10181098B2 (en) * 2014-06-06 2019-01-15 Google Llc Generating representations of input sequences using neural networks
JP6842167B2 (ja) * 2017-05-08 2021-03-17 国立研究開発法人情報通信研究機構 要約生成装置、要約生成方法及びコンピュータプログラム
CN107526725B (zh) * 2017-09-04 2021-08-24 北京百度网讯科技有限公司 基于人工智能的用于生成文本的方法和装置
CN107783960B (zh) * 2017-10-23 2021-07-23 百度在线网络技术(北京)有限公司 用于抽取信息的方法、装置和设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383817A (zh) * 2016-09-29 2017-02-08 北京理工大学 利用分布式语义信息的论文标题生成方法
CN106598921A (zh) * 2016-12-12 2017-04-26 清华大学 基于lstm模型的现代文到古诗的转换方法及装置
CN106980683A (zh) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 基于深度学习的博客文本摘要生成方法
CN107484017A (zh) * 2017-07-25 2017-12-15 天津大学 基于注意力模型的有监督视频摘要生成方法

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737769A (zh) * 2019-10-21 2020-01-31 南京信息工程大学 一种基于神经主题记忆的预训练文本摘要生成方法
CN110737769B (zh) * 2019-10-21 2023-07-25 南京信息工程大学 一种基于神经主题记忆的预训练文本摘要生成方法
CN111178053A (zh) * 2019-12-30 2020-05-19 电子科技大学 一种结合语义和文本结构进行生成式摘要抽取的文本生成方法
CN111199727A (zh) * 2020-01-09 2020-05-26 厦门快商通科技股份有限公司 语音识别模型训练方法、系统、移动终端及存储介质
CN113449096A (zh) * 2020-03-24 2021-09-28 北京沃东天骏信息技术有限公司 生成文本摘要的方法和装置
EP3896595A1 (en) * 2020-04-17 2021-10-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Text key information extracting method, apparatus, electronic device, storage medium, and computer program product
KR20210129605A (ko) * 2020-04-17 2021-10-28 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 텍스트 핵심정보 추출방법, 장치, 전자기기 및 기록매체
JP2021174540A (ja) * 2020-04-17 2021-11-01 ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド テキストのコア情報の抽出方法、装置、電子機器、記憶媒体及びコンピュータープログラム
KR102521586B1 (ko) 2020-04-17 2023-04-12 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 텍스트 핵심정보 추출방법, 장치, 전자기기 및 기록매체
JP7344926B2 (ja) 2020-04-17 2023-09-14 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド テキストの要約の抽出方法、装置、電子機器、記憶媒体及びコンピュータープログラム
CN112507188A (zh) * 2020-11-30 2021-03-16 北京百度网讯科技有限公司 候选搜索词的生成方法、装置、设备及介质
CN112507188B (zh) * 2020-11-30 2024-02-23 北京百度网讯科技有限公司 候选搜索词的生成方法、装置、设备及介质

Also Published As

Publication number Publication date
US20200265192A1 (en) 2020-08-20
SG11202001628VA (en) 2020-03-30
JP6955580B2 (ja) 2021-10-27
JP2020520492A (ja) 2020-07-09
CN108509413A (zh) 2018-09-07

Similar Documents

Publication Publication Date Title
WO2019169719A1 (zh) 文摘自动提取方法、装置、计算机设备及存储介质
CN108399228B (zh) 文章分类方法、装置、计算机设备及存储介质
US20200242302A1 (en) Intention identification method, intention identification apparatus, and computer-readable recording medium
CN108399227B (zh) 自动打标签的方法、装置、计算机设备及存储介质
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
Liu et al. Exploring segment representations for neural segmentation models
CN110609897A (zh) 一种融合全局和局部特征的多类别中文文本分类方法
CN108520041B (zh) 文本的行业分类方法、系统、计算机设备和存储介质
CN112528655B (zh) 关键词生成方法、装置、设备及存储介质
CN110377733B (zh) 一种基于文本的情绪识别方法、终端设备及介质
WO2020215694A1 (zh) 一种基于深度学习的中文分词方法、装置、存储介质及计算机设备
CN112580346B (zh) 事件抽取方法、装置、计算机设备和存储介质
CN116450813B (zh) 文本关键信息提取方法、装置、设备以及计算机存储介质
CN111339308B (zh) 基础分类模型的训练方法、装置和电子设备
Sun et al. Analyzing Cross-domain Transportation Big Data of New York City with Semi-supervised and Active Learning.
CN111177375A (zh) 一种电子文档分类方法及装置
CN112860919A (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
CN115146068A (zh) 关系三元组的抽取方法、装置、设备及存储介质
CN113239697B (zh) 实体识别模型训练方法、装置、计算机设备及存储介质
Jeyakarthic et al. Optimal bidirectional long short term memory based sentiment analysis with sarcasm detection and classification on twitter data
US11720750B1 (en) Method for QA with multi-modal information
WO2021217619A1 (zh) 基于标签平滑的语音识别方法、终端及介质
CN114647727A (zh) 应用于实体信息识别的模型训练方法、装置和设备
CN115169345A (zh) 文本情感分析模型的训练方法、装置、设备及存储介质
CN116306612A (zh) 一种词句生成方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18909256

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019557629

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.12.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18909256

Country of ref document: EP

Kind code of ref document: A1