WO2021164231A1 - 公文摘要提取方法、装置、设备及计算机可读存储介质 - Google Patents

公文摘要提取方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021164231A1
WO2021164231A1 PCT/CN2020/112348 CN2020112348W WO2021164231A1 WO 2021164231 A1 WO2021164231 A1 WO 2021164231A1 CN 2020112348 W CN2020112348 W CN 2020112348W WO 2021164231 A1 WO2021164231 A1 WO 2021164231A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
abstract
extraction layer
official document
candidate
Prior art date
Application number
PCT/CN2020/112348
Other languages
English (en)
French (fr)
Inventor
郑立颖
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021164231A1 publication Critical patent/WO2021164231A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • This application relates to the technical field of data processing, and in particular to a method, device, device, and computer-readable storage medium for extracting an official document abstract.
  • the abstract extraction technology can be used to extract the abstract of the official document.
  • the main extraction technology includes two categories: extractive and generative. Extraction refers to directly extracting important sentences from the text, and then sorting and combining the sentences and outputting as the final summary ; Generative means to refine and summarize based on the original content, allowing new words or sentences to be generated to form a summary.
  • the commonly used extractive abstract method is TextRank, but the original TextRank method is only based on The similarity of the sentence determines the importance of the sentence, and then extracts the sentences with high importance.
  • the official document is different from the general text. Only the similarity of the sentence cannot accurately represent the importance of the sentence in the official document, resulting in inaccurate abstracts. . Therefore, how to improve the accuracy of extracting official document abstracts is a problem that needs to be solved urgently.
  • the present application provides a method for extracting an official document abstract.
  • the method for extracting an official document abstract includes the following steps:
  • the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a second abstract extraction layer, and an abstract Fusion extraction layer;
  • the abstract result set of the official document text is determined according to the first candidate abstract set and the second candidate abstract set.
  • this application also provides an official document abstract extraction device, the official document abstract extraction device including:
  • the acquisition module is used to acquire a sentence set and a preset official document abstract extraction model, wherein the sentence set includes several sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a second abstract extraction layer, and a second abstract extraction layer.
  • Abstract extraction layer and abstract fusion extraction layer
  • the first extraction module is configured to call a preset first thread to extract headline sentences and key sentences from the sentence set based on the first abstract extraction layer, and use the headline sentences and key sentences as the first candidate abstract set; as well as
  • the second extraction module is configured to concurrently call the preset second thread to calculate the importance value of each sentence in the sentence set based on the second summary extraction layer, and to determine the second candidate summary according to the importance value of each sentence set;
  • the summary determination module is configured to determine the summary result set of the official document text based on the summary fusion extraction layer and the first candidate summary set and the second candidate summary set.
  • the present application also provides a computer device that includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is When the processor executes, the following steps are implemented:
  • the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a second abstract extraction layer, and an abstract Fusion extraction layer;
  • the abstract result set of the official document text is determined according to the first candidate abstract set and the second candidate abstract set.
  • this application also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
  • the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a second abstract extraction layer, and an abstract Fusion extraction layer;
  • the abstract result set of the official document text is determined according to the first candidate abstract set and the second candidate abstract set.
  • FIG. 1 is a schematic flowchart of a method for extracting an official document abstract according to an embodiment of the application
  • FIG. 2 is a schematic diagram of a scene in which the method for extracting an official document abstract provided by this embodiment is implemented;
  • FIG. 3 is a schematic flowchart of another method for extracting an official document abstract according to an embodiment of the application
  • FIG. 4 is a schematic block diagram of an apparatus for extracting official document abstracts according to an embodiment of the application.
  • FIG. 5 is a schematic block diagram of another apparatus for extracting official document abstracts according to an embodiment of the application.
  • FIG. 6 is a schematic block diagram of the structure of a computer device related to an embodiment of the application.
  • the embodiments of the present application provide a method, device, computer equipment, and computer-readable storage medium for extracting an official document abstract.
  • the method for extracting official document abstracts can be applied to a terminal device or a server.
  • the terminal device can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, and other electronic devices.
  • the server can be a single server , It can also be a server cluster composed of multiple servers. The following takes the application of the method for extracting official document abstracts to a server as an example for explanation.
  • FIG. 1 is a schematic flowchart of a method for extracting an official document abstract according to an embodiment of the application.
  • the method for extracting an official document abstract includes steps S101 to S104.
  • Step S101 Obtain a sentence set and a preset official document abstract extraction model, where the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer and a second abstract extraction The layer and the abstract are merged to extract the layer.
  • the server may obtain it through a database pre-stored with the sentence set, or may obtain it through an external storage device storing the sentence set.
  • the sentence set includes several sentences in the official document text
  • the database includes a local database and a cloud database
  • the external devices include plug-in hard disks, secure digital cards, flash memory cards, etc., equipped on computer equipment.
  • the server first obtains the official document text, and then splits the official document text to obtain the sentence set of the official document text.
  • the official document text is an electronic document that can be directly read by the server, and the electronic document that can be directly read includes a word document, a txt document, a wps document, and the like.
  • the specific method of obtaining the sentence set of the official document text is: obtaining the official document text, which is an electronic document that cannot be directly read; converting the official document text into a text image, and comparing the converted text image Perform text recognition; extract the text in the text image after text recognition, and split the text to obtain the sentence set of the official document text.
  • electronic documents that cannot be directly read include pdf documents and tif documents.
  • the format of the text image can be set according to the actual situation, and the JPEG image format or the PNG image format can be selected.
  • the specific method for obtaining the sentence set of the official document text is: obtaining the user's shooting instruction of the paper official document, performing a photographing operation on the paper official document, and sending the photographed text image to the server;
  • the image is text-recognized, the official document text after the text recognition is extracted, and the official document text is split to obtain the sentence set of the official document text.
  • the server can extract the sentences in the text image in real time, or it can store the text image first, and then extract the sentences in the text image uniformly.
  • the specific method of splitting the official document text to obtain the sentence set of the official document text is: performing word recognition on the text image to obtain the official document text after the word recognition; extracting the sentence identifier in the official document text after the word recognition For each sentence in the official document text after text recognition, the sentence set of the official document text is obtained.
  • the sentence identifier is a symbol that indicates the end of a sentence in the grammar, including a period, a semicolon, a question mark, an exclamation mark, and an interlacing symbol.
  • the sentence set After obtaining the sentence set, obtain a preset official document abstract extraction model, where the sentence set includes a number of sentences determined according to the official document text to be extracted, and the preset official document abstract extraction model includes a first abstract extraction layer and a second abstract extraction The layer and the abstract are merged to extract the layer.
  • the first abstract extraction layer includes a title extraction sublayer, and the title extraction sublayer is used to extract a title sentence from a sentence set;
  • the second abstract extraction layer includes an importance calculation sublayer and a summary extraction sublayer.
  • the calculation sub-layer is used to calculate the importance value of each sentence in the sentence set.
  • the summary extraction sub-layer is used to determine a candidate summary set containing a preset number of sentences based on the importance value of each sentence.
  • the summary fusion extraction layer The summary result set used to extract the subsequent official document text.
  • Step S102 Invoke a preset first thread to extract headline sentences and key sentences from the sentence set based on the first abstract extraction layer, and use the headline sentences and key sentences as the first candidate abstract set.
  • the server or terminal device After obtaining the sentence set and the preset official document abstract extraction model, the server or terminal device calls the preset first thread and extracts the title sentence and key sentence from the sentence set based on the first abstract extraction layer in the abstract extraction model. And use the extracted headline sentences and key sentences as the first candidate abstract set.
  • the preset first thread is an execution flow in the calling process, which is set according to the specific situation.
  • the first abstract extraction layer includes a title extraction sublayer and a key sentence extraction sublayer. The title sentence is extracted from the sentence set, and the key sentence extraction sublayer is used to extract the key sentence from the sentence set.
  • the specific method for extracting the title sentence and the key sentence from the sentence set is: calling a preset first thread to extract the title sentence from the sentence set based on the regular expression in the first summary extraction layer; and from the first summary
  • the extraction layer obtains the keyword set corresponding to the official document type label of the sentence set, and extracts the key sentence containing the keywords in the keyword set from the sentence set.
  • the title sentence includes the main title, the primary title, and the secondary title.
  • the official document type label is used to identify different official document categories.
  • the official document category includes "decision", "opinion”, "notification”, “notification”, " It should be noted that the regular expression can be set based on actual conditions, and this application does not specifically limit this.
  • Step S103 concurrently invoking a preset second thread calculates the importance value of each sentence in the sentence set based on the second abstract extraction layer, and determines a second candidate abstract set according to the importance value of each sentence.
  • the second summary extraction layer includes an importance calculation sublayer and a summary extraction sublayer.
  • the importance calculation sublayer calculates the importance of each sentence in the sentence set, and the summary extraction sublayer is based on the importance of each sentence.
  • the importance value determines the second candidate summary set.
  • the preset second thread is called concurrently when calling the first thread, which is another execution flow in the calling process, and is set according to specific conditions.
  • the specific method for determining the second candidate abstract set based on the importance value of each sentence is: write the sentence with the highest importance value in the sentence set into the blank second candidate abstract set, and delete the written sentence from the sentence set. Enter the sentences in the second candidate abstract set to obtain the updated sentence set; calculate the importance value of each sentence in the updated sentence set, and write the sentence with the highest importance value into the second candidate abstract set; repeat the above Process until the number of sentences in the second candidate abstract set reaches the number of preset sentences. It should be noted that the number of preset sentences can be set based on actual conditions, and this solution does not specifically limit this.
  • Step S104 Based on the abstract fusion extraction layer, determine the abstract result set of the official document text according to the first candidate abstract set and the second candidate abstract set.
  • the specific method for determining the summary result set of the official document text is: writing the intersection of the first candidate summary set and the second candidate summary set into the blank summary result set to update the summary result set; Remove the intersection in the summary set to update the second candidate summary set; according to the importance value of each sentence in the updated second candidate summary set, sort each sentence in the updated second candidate summary set; according to the updated second candidate summary set
  • the updated sentences in the second candidate abstract set are sequentially written into the summary result set until the number of sentences in the summary result set reaches the preset number of sentences.
  • the intersection of the first candidate abstract set and the second candidate abstract set is not an empty set, that is, the intersection contains at least one sentence of official document text, so that the updated summary result set is not the original set.
  • the number of sentences in the intersection does not reach the number of preset sentences, so that the number of sentences in the summary result set does not meet the number of preset sentences, and the updated sentences in the second candidate summary set need to be written into the summary result set .
  • the first candidate abstract set is ⁇ A, B, C, D, ⁇
  • the second candidate abstract set is ⁇ A, C, E, F, G ⁇
  • the intersection of sets is ⁇ A, C ⁇ ; write the intersection into a blank summary result set, and you can get a new summary result set as ⁇ A, C ⁇ ; remove the intersection from the second candidate abstract set, then the updated second
  • the candidate abstract set is ⁇ E, F, G ⁇ ; in the updated second candidate abstract set, the importance value of the "F” sentence is greater than the importance value of the "E” sentence, and the importance value of the "E” sentence is greater than the " G" sentence importance value
  • the second candidate abstract set after sorting is ⁇ F, E, G ⁇ ; according to the sorting of each sentence in the second candidate abstract set, the sentences “F”, “E” and "G” is written into the summary result set ⁇ A, C ⁇ until the number of sentences in the summary result set reaches the preset number of sentences.
  • Figure 3 is a schematic diagram of a scenario for implementing the method for extracting an official document abstract provided by this embodiment.
  • a user can directly read the sentence set of the official document text through the terminal device to obtain the summary result set ; Or obtain the text image of the official document text through the terminal device, and extract the sentence set of the text image of the official document text to obtain the summary result set; or the terminal device sends the official document text of the sentence that can be directly read to the server, and the server obtains The summary result set of the official document text, and the summary result set is sent back to the terminal device; or the text image of the official document text is sent to the server through the terminal device for recognition, and the sentence set of the official document text is extracted by the server to obtain the official document text Digest the result set, and send the summary result set back to the terminal device.
  • the method for extracting official document abstracts obtains a sentence set and a preset official document abstract extraction model, and calls the preset first thread and second thread to extract the title sentence, key sentence and the importance of each sentence from the sentence set.
  • the degree value improves the accuracy and speed of extracting the title sentence, key sentence and the importance value of each sentence.
  • the first candidate abstract set can be obtained according to the title sentence and key sentence, and the importance value of each sentence can be Determine the second candidate abstract set, and then jointly determine the abstract result set of the official document text based on the first candidate abstract set and the second candidate abstract set.
  • the abstracts extracted from the abstract result set are more accurate, so this application can improve the extraction of official document abstracts. accuracy.
  • FIG. 4 is a schematic flowchart of another method for extracting an official document abstract according to an embodiment of the application.
  • the method for extracting an official document abstract includes steps S201 to 206.
  • Step S201 Obtain a sentence set and a preset official document abstract extraction model, where the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer and a second abstract extraction The layer and the abstract are merged to extract the layer.
  • a preset official document abstract extraction model is obtained, where the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a second abstract extraction layer, and Abstract fusion extraction layer.
  • the first abstract extraction layer includes a title extraction sublayer and a key sentence extraction sublayer
  • the second abstract extraction layer includes an importance calculation sublayer and a summary extraction sublayer
  • the abstract fusion extraction layer is used to extract subsequent official document texts The summary result set.
  • Step S202 Invoke a preset first thread to extract headline sentences and key sentences from the sentence set based on the first abstract extraction layer, and use the headline sentences and key sentences as the first candidate abstract set.
  • the preset first thread is called to extract the title sentence and key sentence from the sentence set based on the first abstract extraction layer in the abstract extraction model, and the extracted title sentence and key
  • the sentence serves as the first candidate abstract set.
  • the first abstract extraction layer includes a title extraction sublayer and a key sentence extraction sublayer.
  • the title extraction sublayer is used to extract headline sentences from a sentence set
  • the key sentence extraction sublayer is used to extract key sentences from the sentence set. Statement.
  • Step S203 Concurrently call the preset second thread to calculate the position representation index of each sentence according to the position number of each sentence in the sentence set.
  • the second thread that concurrently calls the preset calculates the position representation index of each sentence according to the position number of each sentence in the sentence set.
  • the position number is used to indicate the order of each sentence in the sentence set, which can be represented by numbers, such as 1, 2, 3, and the position representation index is used to indicate the importance of different sentences.
  • the importance calculation sub-layer According to the order from beginning to end, sequentially number each statement in the statement set, and get the statement number of each statement.
  • the statement number of the statement number is consecutively arranged.
  • the position of the sentence with the lower number is lower, and the position of the sentence with the lower order. The number is large.
  • the position number of the sentence with the highest order is 1, and the position number of the sentence with the lowest order is 100. It is understandable that the importance calculation sub-layer can also sequentially number each sentence in the sentence set in order from end to beginning. At this time, the number of sentences in the sentence set is continuously proportional to the order of arrangement. This will not be repeated here.
  • the specific method for calculating the position representation index of each sentence is: determining the maximum position number according to the position number of each sentence in the sentence set, and calculating the difference between the position number of each sentence in the sentence set and the maximum position number The absolute value of the value; according to the absolute value of each difference and the maximum position number, determine the weight coefficient of each sentence in the sentence set; according to the absolute value of the difference between the position number of each sentence in the sentence set and the maximum position number and the weight of each sentence Coefficient, which determines the position representation index of each sentence.
  • the weight coefficient of each sentence is calculated by dividing the absolute value of the difference between the position number of each sentence and the maximum position number by the maximum position number to obtain the weight coefficient of each sentence.
  • the position characterization index of each sentence is calculated by multiplying the absolute value of the difference between the position number of each sentence and the maximum position number by the weight coefficient of each sentence to obtain the position characterization index of each sentence. It should be noted that the weight coefficient of each sentence ranges from 0 to 1, and the maximum position number is a fixed value. The smaller the position number of each sentence, the larger the weight coefficient of each sentence, and the position of each sentence The smaller the number, the larger the position representation index of each sentence.
  • N is the maximum number location
  • S i is the position number of each statement
  • ⁇ i is the weight for each statement weighting coefficients
  • a i is the index characterizing the position of each statement.
  • Step S204 Obtain the main headline sentence from the sentence set, and calculate the similarity between each sentence in the sentence set and the main headline sentence.
  • the main title sentence is obtained from the sentence set, and the similarity between each sentence in the sentence set and the main title sentence is calculated.
  • the main title sentence is the main title of the official document text.
  • Each sentence in the sentence set is calculated with the main title sentence. The similarity between the two can analyze the importance of each sentence in the sentence set and improve the accuracy of extracting official document abstracts.
  • the specific method for calculating the similarity between each sentence in the sentence set and the main headline sentence is: determining the number of words corresponding to each sentence in the sentence set, and determining the number of headline words of the main headline sentence; The number of identical words in a sentence and the main headline sentence is obtained, and the number of identical words corresponding to each sentence is obtained; each sentence in the sentence set is calculated according to the number of words in the title and the number of words corresponding to each sentence and the number of identical words The degree of similarity with the main headline sentence.
  • the calculation method of the similarity between each sentence and the main headline sentence is: multiply the number of the same words in each sentence and the main headline sentence by 2, then divide by the number of headline words of the main headline sentence and each sentence The sum of the number of words corresponding to each, obtains the similarity between each sentence and the main headline sentence.
  • the expression for calculating the similarity between each sentence in the sentence set and the main headline sentence is:
  • n is the number of headline words of the main headline sentence
  • N j is the number of the same text in each sentence and main headline sentence
  • N i is the number of words corresponding to each sentence
  • B i is each sentence and the main headline The similarity between sentences.
  • Step S205 Determine the importance value of each sentence in the sentence set according to the similarity between each sentence and the main headline sentence and the position representation index of each sentence.
  • the similarity between each sentence and the main headline sentence and the position representation index of each sentence can indicate the importance of the sentence, according to the similarity between each sentence and the main headline sentence and each sentence
  • the position representation index of can determine the importance value of each sentence in the sentence set, so that the determined importance value of each sentence is more accurate.
  • the specific method for determining the importance value of each sentence in the sentence set is: obtaining a preset first weighting coefficient and a second weighting coefficient; determining according to the first weighting coefficient and the position characterization index of each sentence The first importance value of each sentence; according to the second weight coefficient and the similarity between each sentence and the main headline sentence, the second importance value of each sentence is determined; according to the first importance value of each sentence And the second importance value to determine the importance value of each sentence in the sentence set.
  • the calculation method of the importance value of each sentence is: the product of the first weight coefficient and the position characterization index, and the sum of the product of the second weight coefficient and the similarity between each sentence and the main headline sentence, to get The importance value of each sentence.
  • the sum of the preset first weighting coefficient and the second weighting coefficient is 1, and the preset first weighting coefficient and the second weighting coefficient can be set based on actual conditions, which is not specifically limited in this application.
  • the importance value of each sentence in the sentence set is the first importance value plus the second importance value
  • the calculation expression for the importance value of each sentence in the sentence set is:
  • is the first weight coefficient
  • 1- ⁇ is the second weight coefficient
  • a i is the position representation index
  • B i is the similarity between each sentence and the main headline sentence
  • C i is the value of each sentence in the sentence set. Importance value.
  • Step S206 Extract a second candidate abstract set according to the importance value of each sentence.
  • Step S207 Based on the abstract fusion extraction layer, a summary result set of the official document text is determined according to the first candidate abstract set and the second candidate abstract set.
  • the summary result set of the sentence set that is, the first candidate abstract set.
  • the union of a candidate abstract set and a second candidate abstract set is used as a summary result set of the sentence set, wherein the abstract fusion extraction layer is used to extract the summary result set of the official document text.
  • the specific method for determining the summary result set of the official document text is: obtaining the intersection of the first candidate abstract set and the second candidate abstract set, and determining the number of sentences in the intersection; if the number of sentences in the intersection is If the number of sentences in the intersection is greater than the preset number of sentences, the intersection is used as the summary result set; if the number of sentences in the intersection is zero, that is, the intersection is an empty set, each sentence in the second candidate abstract set is sorted and ordered For the ordering of each sentence, the sentences are written into the intersection in turn until the number of sentences in the intersection reaches the preset number of sentences.
  • the number of preset sentences can be set according to the actual situation, this article does not make specific restrictions here, and can choose 10 sentences.
  • the method for extracting official document abstracts determines the importance of each sentence by calculating the position representation index of each sentence and the similarity between each sentence and the main headline sentence, and determines the importance of each sentence according to the relationship between each sentence and the main headline sentence.
  • the similarity between headline sentences and the position representation index of each sentence can determine the importance of each sentence in the sentence set, accurately quantify the importance of each sentence, and intuitively compare the importance of each sentence. Effectively improve the accuracy of extracting official document abstracts.
  • FIG. 4 is a schematic block diagram of an apparatus for extracting an official document abstract according to an embodiment of the application.
  • the apparatus 300 for extracting an official document abstract includes: an acquisition module 301, a first extraction module 302, a second extraction module 303, and an abstract determination module 304.
  • the obtaining module 301 is configured to obtain a sentence set and a preset official document abstract extraction model, wherein the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer and a first abstract extraction layer. 2. Abstract extraction layer and abstract fusion extraction layer;
  • the first extraction module 302 is configured to call a preset first thread to extract a title sentence and a key sentence from the sentence set based on the first abstract extraction layer, and use the title sentence and the key sentence as the first candidate abstract set ;as well as
  • the second extraction module 303 is configured to concurrently call a preset second thread to calculate the importance value of each sentence in the sentence set based on the second summary extraction layer, and determine the second candidate according to the importance value of each sentence Summary set
  • the summary determination module 304 is configured to determine a summary result set of the official document text according to the first candidate summary set and the second candidate summary set based on the summary fusion extraction layer.
  • the first extraction module 302 is further used for:
  • the keyword set corresponding to the official document type label of the sentence set is obtained from the first abstract extraction layer, and the key sentence containing the keywords in the keyword set is extracted from the sentence set.
  • the summary determining module 304 is further used for:
  • the sentences in the updated second candidate abstract set are sequentially written into the summary result set until the number of sentences in the summary result set reaches the preset sentence number.
  • FIG. 5 is a schematic block diagram of another apparatus for extracting official document abstracts according to an embodiment of the application.
  • the official document abstract extraction device 400 includes: an acquisition module 401, a first extraction module 402, a first calculation module 403, a second calculation module 404, a third calculation module 405, a second extraction module 406, and an abstract Determine module 407.
  • the obtaining module 401 is configured to obtain a sentence set and a preset official document abstract extraction model, where the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a first abstract extraction layer, and a second abstract extraction model.
  • the first extraction module 402 is configured to call a preset first thread to extract a title sentence and a key sentence from the sentence set based on the first abstract extraction layer, and use the title sentence and the key sentence as a first candidate abstract set ;as well as
  • the first calculation module 403 is configured to concurrently call the preset second thread to calculate the position representation index of each sentence according to the position number of each sentence in the sentence set;
  • the second calculation module 404 is configured to obtain the main headline sentence from the sentence set, and calculate the similarity between each sentence in the sentence set and the main headline sentence;
  • the third calculation module 405 is configured to determine the importance value of each sentence in the sentence set according to the similarity between each sentence and the main headline sentence and the position representation index of each sentence;
  • the second extraction module 406 is configured to extract a second candidate abstract set according to the importance value of each sentence
  • the summary determination module 407 is configured to determine a summary result set of the official document text according to the first candidate summary set and the second candidate summary set based on the summary fusion extraction layer.
  • the first calculation module 403 is also used to:
  • the position characterization index of each sentence is determined.
  • the second calculation module 404 is further used for:
  • the similarity between each sentence in the sentence set and the main headline sentence is calculated.
  • the third calculation module 405 is further configured to:
  • the importance value of each sentence in the sentence set is determined.
  • the apparatus provided in the foregoing embodiment may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 6.
  • FIG. 6 is a schematic block diagram of a structure of a computer device provided by an embodiment of the application.
  • the computer device can be a server or a terminal device.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile or volatile storage medium and an internal memory.
  • Non-volatile or volatile storage media can store operating systems and computer programs.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute any method for extracting official document abstracts.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of a computer program in a non-volatile or volatile storage medium.
  • the processor can execute any method for extracting an official document abstract.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the processor is used to run a computer program stored in a memory to implement the following steps:
  • the sentence set includes a number of sentences determined according to the official document text to be extracted, and the official document abstract extraction model includes a first abstract extraction layer, a second abstract extraction layer, and an abstract Fusion extraction layer;
  • the abstract result set of the official document text is determined according to the first candidate abstract set and the second candidate abstract set.
  • the preset first thread is invoked to extract headline sentences and key sentences from the sentence set based on the first abstract extraction layer to implement:
  • the keyword set corresponding to the official document type label of the sentence set is obtained from the first abstract extraction layer, and the key sentence including the keywords in the keyword set is extracted from the sentence set.
  • the processor when the processor realizes that the second thread preset for concurrent invocation calculates the importance value of each sentence in the sentence set based on the second summary extraction layer, it is used to realize:
  • the second thread that concurrently calls the preset calculates the position representation index of each sentence according to the position number of each sentence in the sentence set.
  • the importance value of each sentence in the sentence set is determined.
  • the processor is used to implement the following when calculating the position representation index of each sentence according to the position number of each sentence in the sentence set:
  • the position characterization index of each sentence is determined.
  • the processor is configured to realize the calculation of the similarity between each sentence in the sentence set and the main headline sentence:
  • the similarity between each sentence in the sentence set and the main headline sentence is calculated.
  • the processor determines the importance value of each sentence in the sentence set based on the similarity between each sentence and the main headline sentence and the position characterization index of each sentence.
  • the importance value of each sentence in the sentence set is determined.
  • the processor is configured to determine the summary result set of the official document text according to the first candidate summary set and the second candidate summary set according to the summary fusion extraction layer based on the summary accomplish:
  • the sentences in the updated second candidate abstract set are sequentially written into the summary result set until the number of sentences in the summary result set reaches the preset sentence number.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, a computer program is stored on the computer-readable storage medium, and the computer
  • the program includes program instructions, and the method implemented when the program instructions are executed can refer to the various embodiments of the method for extracting an official document abstract of this application.
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

Abstract

一种公文摘要提取方法、装置、设备及计算机可读存储介质,该方法包括:获取语句集和预设的公文摘要抽取模型,其中,公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;调用预设的第一线程基于第一摘要提取层从语句集中提取标题语句和关键语句,并将标题语句和关键语句作为第一候选摘要集;以及并发调用预设的第二线程基于第二摘要提取层计算语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;基于摘要融合提取层,根据第一候选摘要集和第二候选摘要集,确定公文文本的摘要结果集。涉及数据处理领域,可以提高公文摘要抽取的准确性。

Description

公文摘要提取方法、装置、设备及计算机可读存储介质
本申请要求于2020年2月18日提交中国专利局、申请号为CN202010100140.1、名称为“公文摘要提取方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理的技术领域,尤其涉及一种公文摘要提取方法、装置、设备及计算机可读存储介质。
背景技术
目前,可以通过摘要抽取技术对公文进行摘要抽取,主要的抽取技术包括抽取式和生成式两大类,抽取式指直接从文中抽取重要的语句,再将语句进行排序组合后输出作为最终的摘要;生成式是指根据原文内容进行提炼总结,允许有新的词语或者语句生成来形成摘要。
发明人意识到生成式摘要需要大量的标注数据,而摘要的标注没有统一的标准且比较耗时,无法准确的提取公文的摘要,而常用的抽取式摘要方法是TextRank,但是原始TextRank方法只是基于语句的相似度确定语句的重要性,再抽取重要性高的语句,但公文与一般的文本不同,仅通过语句的相似度无法准确的表征语句在公文中的重要性,导致抽取的摘要不准确。因此,如何提高公文摘要抽取的准确性是目前亟待解决的问题。
发明内容
第一方面,本申请提供一种公文摘要提取方法,所述公文摘要提取方法包括以下步骤:
获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
第二方面,本申请还提供一种公文摘要提取装置,所述公文摘要提取装置包括:
获取模块,用于获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
第一提取模块,用于调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
第二提取模块,用于并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
摘要确定模块,用于基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
第三方面,本申请还提供一种计算机设备,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如下步骤:
获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
第四方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如下步骤:
获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种公文摘要提取方法的流程示意图;
图2为实施本实施例提供的公文摘要提取方法的一场景示意图;
图3为本申请实施例提供的另一种公文摘要提取方法的流程示意图;
图4为本申请实施例提供的一种公文摘要提取装置的示意性框图;
图5为本申请实施例提供的另一种公文摘要提取装置的示意性框图;
图6为本申请一实施例涉及的计算机设备的结构示意框图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本申请实施例提供一种公文摘要提取方法、装置、计算机设备及计算机可读存储介质。其中,该公文摘要提取方法可应用于终端设备或服务器中,该终端设备可以手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备,该服务器可以为单台的服务器,也可以为由多台服务器组成的服务器集群。以下以该公文摘要提取方法应用于服务器为例进行解释说明。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参照图1,图1为本申请的实施例提供的一种公文摘要提取方法的流程示意图。
如图1所示,该公文摘要提取方法包括步骤S101至步骤S104。
步骤S101、获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层。
在用户需要获取公文文本的语句集时,服务器可以通过预存有该语句集的数据库获取,也可以通过存储有该语句集的外部存储设备获取。其中,语句集包括若干句该公文文本中的语句,数据库包括本地数据库和云端数据库,外部设备包括计算机设备上配备的插接式硬盘,安全数字卡,闪存卡等。或者,服务器先获取公文文本,再对该公文文本进行拆分得到该公文文本的语句集。其中,该公文文本为服务器可直接读取的电子文档,该可直接读取的电子文档包括word文档、txt文档和wps文档等。
在一实施例中,获取公文文本的语句集的具体方式为:获取公文文本,该公文文本为不可直接读取的电子文档;将该公文文本转换成文本图像,并对经过转换后的文本图像进行文字识别;提取经过文字识别后的文本图像中的文本,并对该文本进行拆分得到该公文文本的语句集。其中,不可直接读取的电子文档包括pdf文档和tif文档等,需要说明的是,该文本图像的格式可根据实际情况进行设置,可选为JPEG图像格式或PNG图像格式。
在一实施例中,获取公文文本的语句集的具体方式为:获取用户对纸质公文的拍摄指令,并对纸质公文执行拍照操作,将拍照得到的文本图像发送至服务器;通过服务器对文本图像进行文字识别,提取经过文字识别后的公文文本,并对该公文文本进行拆分得到该公文文本的语句集。需要说明的是,在对文本图像进行文字识别之后,服务器可以实时提取文本图像中的语句,也可以先存储文本图像,之后再统一提取文本图像中的语句。
其中,对公文文本进行拆分得到公文文本的语句集的具体方式为:对文本图像进行文字识别,得到经过文字识别后的公文文本;根据经过文字识别后的公文文本中的断句标识符,提取经过文字识别后的公文文本中的每一语句,得到该公文文本的语句集。其中,该断句标识符为语法中表示语句结束的符号,包括句号、分号、问号、感叹号和隔行符号等。
获取语句集之后,在获取预设的公文摘要抽取模型,其中,该语句集包括根据待提取的公文文本确定的若干语句,预设的公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层。
其中,该第一摘要提取层包括标题提取子层,该标题提取子层用于从语句集中提取出标题语句;该第二摘要提取层包括重要性计算子层和摘要提取子层,该重要性计算子层用于计算该语句集中的每个语句的重要程度值,该摘要提取子层用于基于每个语句的重要程度值确定一个包含预设数量语句的候选摘要集,该摘要融合提取层用于提取后续的公文文本的摘要结果集。
步骤S102、调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集。
在获取语句集和预设的公文摘要抽取模型之后,服务器或终端设备调用预设的第一线程,并基于该摘要抽取模型中的第一摘要抽取层,从语句集中抽取标题语句和关键语句,并将抽取到的标题语句和关键语句作为第一候选摘要集。其中,预设的第一线程是该调用进程中的一条执行流,根据具体情况进行设置,该第一摘要抽取层包括标题提取子层和关键句提取子层,该标题提取子层用于从语句集中提取出标题语句,该关键句提取子层用于从该语句集中提取出关键语句。
在一实施例中,从语句集中提取标题语句和关键语句的具体方式为:调用预设的第一线程基于第一摘要提取层中的正则表达式从语句集中提取标题语句;以及从第一摘要提取 层中获取语句集的公文类型标签对应的关键词集合,并从语句集中提取包含关键词集合中的关键词的关键语句。其中,该标题语句包括主标题、一级标题和二级标题等,该公文类型标签用于标识不同公文类别,该公文类别包括“决定”、“意见”、“通知”、“通报”、“报告”、“请示”、“批复”、“函”、“会议纪要”等,需要说明的是,该正则表达式可基于实际情况进行设置,本申请对此不作具体限定。通过收集每种公文类别的公文中标识公文重要内容的关键词,形成关键词集合,并将公文类别与关键词集合进行关联存储,可以得到公文类别与关键词集合的映射关系表。
步骤S103、并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集。
在运行第一线程通过第一摘要提取层从语句集中提取标题语句和关键语句的同时,并发运行第二线程通过第二摘要提取层计算该语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集。其中,该第二摘要提取层包括重要性计算子层和摘要抽取子层,通过重要性计算子层计算该语句集中每个语句的重要程度值,并通过摘要抽取子层,基于每个语句的重要程度值确定第二候选摘要集,另外,预设的第二线程在调用第一线程时并发调用,是调用进程中的另一条执行流,根据具体情况进行设置。
进一步地,基于每个语句的重要程度值确定第二候选摘要集的具体方式为:将语句集中重要程度值最高的语句写入空白的第二候选摘要集,并在该语句集合中删除已写入第二候选摘要集中的语句,得到更新后的语句集合;计算更新后的语句集合中每个语句的重要程度值,并将该重要程度值最高的语句写入第二候选摘要集;重复上述过程,直至第二候选摘要集中的语句个数达到预设语句个数。需要说明的是,该预设语句个数可基于实际情况进行设置,本方案对此不作具体限定。
步骤S104、基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
在确定第一候选摘要集和第二候选摘要集之后,基于摘要融合提取层,根据确定的第一候选摘要集和第二候选摘要集,确定该语句集的摘要结果集,即基于摘要融合提取层,获取第一候选摘要集与第二候选摘要集的交集,并根据第一候选摘要集与第二候选摘要集的交集和第二候选摘要集,确定公文文本的摘要结果集,其中,摘要融合提取层用于提取公文文本的摘要结果集。
在一实施例中,确定公文文本的摘要结果集的具体方式为:将第一候选摘要集与第二候选摘要集的交集写入空白的摘要结果集,以更新摘要结果集;从第二候选摘要集中去除交集,以更新第二候选摘要集;根据更新后的第二候选摘要集中每个语句的重要程度值,对更新后的第二候选摘要集中的每个语句进行排序;根据更新后的第二候选摘要集中每个语句的排序,依次将更新后的第二候选摘要集中的语句写入摘要结果集,直至摘要结果集中的语句个数达到预设语句个数。
需要说明的是,上述第一候选摘要集与第二候选摘要集的交集不为空集,即该交集中包含至少一句公文文本的语句,使得摘要结果集更新后不为原集。同时,该交集中的语句个数未达到预设语句个数,使得摘要结果集中的语句个数不满足预设语句个数,需要将更新后的第二候选摘要集中的语句写入摘要结果集。通过从第二候选摘要集中去除交集,以及根据每个语句的重要程度值进行排序和将排序靠前的语句写入摘要结果集,可以提高公文摘要提取的准确性。
示例性地,第一候选摘要集为{A、B、C、D、},第二候选摘要集为{A、C、E、F、G},则第一候选摘要集与第二候选摘要集的交集为{A、C};将该交集写入空白的摘要结果集,可以得到新的摘要结果集为{A、C};从第二候选摘要集中去除交集,则更新后的第二候选摘要集为{E、F、G};在更新后的第二候选摘要集中,“F”语句的重要程度值大于“E”语句的重要程度值,“E”语句的重要程度值大于“G”语句的重要程度值,则排序后的第二候 选摘要集为{F、E、G};根据该第二候选摘要集中每个语句的排序,依次将语句“F”、“E”和“G”写入摘要结果集{A、C},直至摘要结果集中的语句个数达到预设语句个数。
如图3所示,图3为实施本实施例提供的公文摘要提取方法的一场景示意图,当用户需要获取公文文本摘要时,可通过终端设备直接读取公文文本的语句集,得到摘要结果集;或通过终端设备获取公文文本的文本图像,并提取该公文文本的文本图像的语句集,得到摘要结果集;也可以是终端设备将可直接读取语句的公文文本发送至服务器,通过服务器得到公文文本的摘要结果集,并将该摘要结果集发回终端设备;或是通过终端设备将公文文本的文本图像发送至服务器进行识别,并通过服务器提取该公文文本的语句集,得到公文文本的摘要结果集,并将该摘要结果集发回终端设备。
上述实施例提供的公文摘要提取方法,通过获取语句集和预设的公文摘要抽取模型,并调用预设的第一线程和第二线程从语句集中提取标题语句、关键语句和每个语句的重要程度值,提高了提取标题语句、关键语句和每个语句的重要程度值的准确性和速度,同时,根据标题语句和关键语句可以得到第一候选摘要集,根据每个语句的重要程度值可以确定第二候选摘要集,再根据第一候选摘要集和第二候选摘要集,共同确定公文文本的摘要结果集,该摘要结果集的提取的摘要较为准确,故本申请可以提高公文摘要抽取的准确性。
请参照图4,图4为本申请实施例提供的另一种公文摘要提取方法的流程示意图。
如图4所示,该公文摘要提取方法包括步骤S201至206。
步骤S201、获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层。
获取语句集之后,再获取预设的公文摘要抽取模型,其中,该语句集包括根据待提取的公文文本确定的若干语句,该公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层。其中,该第一摘要提取层包括标题提取子层和关键句提取子层;该第二摘要提取层包括重要性计算子层和摘要提取子层,该摘要融合提取层用于提取后续的公文文本的摘要结果集。
步骤S202、调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集。
在获取到语句集和公文摘要抽取模型之后,调用预设的第一线程基于该摘要抽取模型中的第一摘要抽取层从语句集中抽取标题语句和关键语句,并将抽取到的标题语句和关键语句作为第一候选摘要集。其中,该第一摘要抽取层包括标题提取子层和关键句提取子层,该标题提取子层用于从语句集中提取出标题语句,该关键句提取子层用于从该语句集中提取出关键语句。
步骤S203、并发调用预设的第二线程根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数。
并发调用预设的第二线程根据语句集中每个语句的位置编号,计算每个语句的位置表征指数。其中,位置编号用于表示语句集中每个语句的排列顺序,可用数字表示,例如1、2、3,位置表征指数用于表示不同语句的重要程度,需要说明的是,通过重要性计算子层按照从头到尾的顺序,依次给该语句集中的每个语句进行编号,得到每个语句的语句编号,该语句编号连续排列顺序靠前的语句的位置编号小,排列顺序靠后的语句的位置编号大,例如,在100句语句组成的语句集中,排列顺序最靠前的语句的位置编号为1,排列顺序最靠后的语句的位置编号为100。可以理解的是,通过重要性计算子层也可以按照从尾到头的顺序,依次给该语句集中的每个语句进行编号,此时该语句集中的语句的编号连续与排列顺序成正比例关系,在此不再赘述。
在一实施例中,计算每个语句的位置表征指数的具体方式为:根据语句集中每个语句的位置编号,确定最大位置编号,并计算语句集中每个语句的位置编号与最大位置编号的 差值绝对值;根据每个差值绝对值和最大位置编号,确定语句集中每个语句的权重系数;根据语句集中每个语句的位置编号与最大位置编号的差值绝对值以及每个语句的权重系数,确定每个语句的位置表征指数。
其中,每个语句的权重系数的计算方式为:用每个语句的位置编号与最大位置编号的差值绝对值除以最大位置编号,得到每个语句的权重系数。每个语句的位置表征指数的计算方式为:用每个语句的位置编号与最大位置编号的差值绝对值乘以每个语句的权重系数,得到每个语句的位置表征指数。需要说明的是,每个语句的权重系数的取值区间为0至1,最大位置编号为固定值,每个语句的位置编号越小则每个语句的权重系数越大,每个语句的位置编号越小则每个语句的位置表征指数越大。
具体地,每个语句的权重系数的计算表达式为:
η i=|S i-N|/N
且每个语句的位置表征指数的计算表达式为:
A i=η i*|S i-N|
其中,N为最大位置编号,S i为每个语句的位置编号,η i为每个语句的权重系数,A i为每个语句的位置表征指数。
步骤S204、从所述语句集中获取主标题语句,并计算所述语句集中每个语句与所述主标题语句之间的相似度。
同时,从语句集中获取主标题语句,并计算语句集中每个语句与主标题语句之间的相似度,其中,主标题语句为公文文本的主标题,通过计算语句集中每个语句与主标题语句之间的相似度,可以分析语句集中每个语句的重要程度,提高公文摘要抽取的准确性。
在一实施例中,计算语句集中每个语句与主标题语句之间的相似度的具体方式为:确定语句集中每个语句各自对应的文字个数,并确定主标题语句的标题字数;统计每个语句和主标题语句中相同文字的个数,得到每个语句各自对应的相同文字个数;根据标题字数以及每个语句各自对应的文字个数和相同文字个数,计算语句集中每个语句与主标题语句之间的相似度。
其中,每个语句与主标题语句之间的相似度的计算方式为:用每个语句和主标题语句中相同文字的个数乘以2,再除以主标题语句的标题字数与每个语句各自对应的文字个数之和,得到每个语句与主标题语句之间的相似度。每个语句和主标题语句中相同文字的个数越多,每个语句与主标题语句之间的相似度越高。
具体地,语句集中每个语句与主标题语句之间的相似度的计算表达式为:
B i=2*N j/(n+N i)
其中,n为主标题语句的标题字数,N j为每个语句和主标题语句中相同文字的个数,N i为每个语句各自对应的文字个数,B i为每个语句与主标题语句之间的相似度。
步骤S205、根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值。
在语句集中,每个语句与主标题语句之间的相似度和每个语句的位置表征指数都可以表示该语句的重要程度,根据每个语句与主标题语句之间的相似度和每个语句的位置表征指数,可以确定语句集中每个语句的重要程度值,使得确定的每个语句的重要程度值更准确。
在一实施例中,确定语句集中每个语句的重要程度值的具体方式为:获取预设的第一权重系数和第二权重系数;根据第一权重系数和每个语句的位置表征指数,确定每个语句的第一重要程度值;根据第二权重系数以及每个语句与主标题语句之间的相似度,确定每个语句的第二重要程度值;根据每个语句的第一重要程度值和第二重要程度值,确定语句集中每个语句的重要程度值。
其中,每个语句的重要程度值的计算方式为:用第一权重系数与位置表征指数的乘积, 与第二权重系数与每个语句与主标题语句之间的相似度的乘积之和,得到每个语句的重要程度值。需要说明的是,预设的第一权重系数与第二权重系数之和为1,预设的第一权重系数和第二权重系数可基于实际情况进行设置,本申请对此不作具体限定。
具体地,语句集中每个语句的重要程度值为第一重要程度值加第二重要程度值,语句集中每个语句的重要程度值的计算表达式为:
C i=α*A i+(1-α)*B i
其中,α为第一权重系数、1-α为第二权重系数、A i为位置表征指数、B i为每个语句与主标题语句之间的相似度,C i为语句集中每个语句的重要程度值。
步骤S206、根据每个语句的重要程度值提取第二候选摘要集。
根据每个语句的重要程度值确定第二候选摘要集,即按照每个语句的重要程度值的大小,对每个语句进行排序,得到语句序列;按照该语句序列的顺序,将顺序在前的语句依次写入候选摘要集,直到该候选摘要集中的语句个数达到预设语句个数。
步骤S207、基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
在确定第一候选摘要集和第二候选摘要集之后,基于所述摘要融合提取层,并根据确定的第一候选摘要集和第二候选摘要集,确定该语句集的摘要结果集,即将第一候选摘要集与第二候选摘要集的并集作为该语句集的摘要结果集,其中,摘要融合提取层用于提取公文文本的摘要结果集。
在一实施例中,确定公文文本的摘要结果集的具体方式为:获取第一候选摘要集与第二候选摘要集的交集,确定该交集中的语句个数;若该交集中的语句个数大于预设语句个数,则将该交集作为摘要结果集;若该交集中的语句个数为零,即该交集为空集,则对第二候选摘要集中的每个语句进行排序,并按照每个语句的排序,依次将语句写入交集,直至交集中的语句个数达到预设语句个数。其中,该预设语句个数可根据实际情况进行设置,本文在此不做具体限定,可选为10句。
上述实施例提供的公文摘要提取方法,通过计算每个语句的位置表征指数和每个语句与所述主标题语句之间的相似度,确定每个语句的重要程度,并根据每个语句与主标题语句之间的相似度和每个语句的位置表征指数,可以确定语句集中每个语句的重要程度值,准确地将每个语句的重要程度量化,可以直观地比较每个语句的重要性,有效的提高公文摘要提取的精确性。
请参照图4,图4为本申请实施例提供的一种公文摘要提取装置的示意性框图。
如图4所示,该公文摘要提取装置300,包括:获取模块301、第一提取模块302、第二提取模块303和摘要确定模块304。
获取模块301,用于获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
第一提取模块302,用于调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
第二提取模块303,用于并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
摘要确定模块304,用于基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
在一个实施例中,第一提取模块302,还用于:
调用预设的第一线程基于所述第一摘要提取层中的正则表达式从所述语句集中提取标题语句;以及
从所述第一摘要提取层中获取所述语句集的公文类型标签对应的关键词集合,并从所 述语句集中提取包含所述关键词集合中的关键词的关键语句。
在一个实施例中,摘要确定模块304,还用于:
将所述第一候选摘要集与所述第二候选摘要集的交集写入空白的摘要结果集,以更新所述摘要结果集;
从所述第二候选摘要集中去除所述交集,以更新所述第二候选摘要集;
根据更新后的第二候选摘要集中每个语句的重要程度值,对更新后的第二候选摘要集中的每个语句进行排序;
根据更新后的第二候选摘要集中每个语句的排序,依次将更新后的第二候选摘要集中的语句写入所述摘要结果集,直至所述摘要结果集中的语句个数达到预设语句个数。
请参照图5,图5为本申请实施例提供的另一种公文摘要提取装置的示意性框图。
如图5所示,该公文摘要提取装置400,包括:获取模块401、第一提取模块402、第一计算模块403、第二计算模块404、第三计算模块405、第二提取模块406和摘要确定模块407。
获取模块401,用于获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
第一提取模块402,用于调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
第一计算模块403,用于并发调用预设的第二线程根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数;以及
第二计算模块404,用于从所述语句集中获取主标题语句,并计算所述语句集中每个语句与所述主标题语句之间的相似度;
第三计算模块405,用于根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值;
第二提取模块406,用于根据每个语句的重要程度值提取第二候选摘要集;
摘要确定模块407,用于基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
在一实施例中,第一计算模块403还用于:
根据所述语句集中每个语句的位置编号,确定最大位置编号,并计算所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值;
根据每个所述差值绝对值和最大位置编号,确定所述语句集中每个语句的权重系数;
根据所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值以及每个语句的权重系数,确定每个语句的位置表征指数。
在一实施例中,第二计算模块404还用于:
确定所述语句集中每个语句各自对应的文字个数,并确定所述主标题语句的标题字数;
统计每个语句和所述主标题语句中相同文字的个数,得到每个语句各自对应的相同文字个数;
根据所述标题字数以及每个语句各自对应的所述文字个数和所述相同文字个数,计算所述语句集中每个语句与所述主标题语句之间的相似度。
在一实施例中,第三计算模块405还用于:
获取预设的第一权重系数和第二权重系数;
根据所述第一权重系数和每个语句的位置表征指数,确定每个语句的第一重要程度值;
根据所述第二权重系数以及每个语句与所述主标题语句之间的相似度,确定每个语句的第二重要程度值;
根据每个语句的所述第一重要程度值和第二重要程度值,确定所述语句集中每个语句 的重要程度值。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块及单元的具体工作过程,可以参考前述公文摘要提取方法实施例中的对应过程,在此不再赘述。
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图6所示的计算机设备上运行。
请参阅图6,图6为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以为服务器或终端设备。
如图6所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性或易失性存储介质和内存储器。
非易失性或易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种公文摘要提取方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性或易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种公文摘要提取方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:
获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
在一个实施例中,所述处理器在实现时所述调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,用于实现:
调用预设的第一线程基于所述第一摘要提取层中的正则表达式从所述语句集中提取标题语句;以及
从所述第一摘要提取层中获取所述语句集的公文类型标签对应的关键词集合,并从所述语句集中提取包含所述关键词集合中的关键词的关键语句。
在一个实施例中,所述处理器在实现所述并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值时,用于实现:
并发调用预设的第二线程根据所述语句集中每个语句的位置编号,计算每个语句的位 置表征指数;以及
从所述语句集中获取主标题语句,并计算所述语句集中每个语句与所述主标题语句之间的相似度;
根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值。
在一个实施例中,所述处理器在实现所述根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数时,用于实现:
根据所述语句集中每个语句的位置编号,确定最大位置编号,并计算所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值;
根据每个所述差值绝对值和最大位置编号,确定所述语句集中每个语句的权重系数;
根据所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值以及每个语句的权重系数,确定每个语句的位置表征指数。
在一个实施例中,所述处理器在实现所述计算所述语句集中每个语句与所述主标题语句之间的相似度时,用于实现:
确定所述语句集中每个语句各自对应的文字个数,并确定所述主标题语句的标题字数;
统计每个语句和所述主标题语句中相同文字的个数,得到每个语句各自对应的相同文字个数;
根据所述标题字数以及每个语句各自对应的所述文字个数和所述相同文字个数,计算所述语句集中每个语句与所述主标题语句之间的相似度。
在一个实施例中,所述处理器在实现所述根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值时,用于实现:
获取预设的第一权重系数和第二权重系数;
根据所述第一权重系数和每个语句的位置表征指数,确定每个语句的第一重要程度值;
根据所述第二权重系数以及每个语句与所述主标题语句之间的相似度,确定每个语句的第二重要程度值;
根据每个语句的所述第一重要程度值和第二重要程度值,确定所述语句集中每个语句的重要程度值。
在一个实施例中,所述处理器在实现所述基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集时,用于实现:
将所述第一候选摘要集与所述第二候选摘要集的交集写入空白的摘要结果集,以更新所述摘要结果集;
从所述第二候选摘要集中去除所述交集,以更新所述第二候选摘要集;
根据更新后的第二候选摘要集中每个语句的重要程度值,对更新后的第二候选摘要集中的每个语句进行排序;
根据更新后的第二候选摘要集中每个语句的排序,依次将更新后的第二候选摘要集中的语句写入所述摘要结果集,直至所述摘要结果集中的语句个数达到预设语句个数。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质上存储有计算机程序,所述计算机程序中包括程序指令,所述程序指令被执行时所实现的方法可参照本申请公文摘要提取方法的各个实施例。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并 不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种公文摘要提取方法,其中,包括:
    获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
    调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
    并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
    基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
  2. 根据权利要求1所述的公文摘要提取方法,其中,所述调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,包括:
    调用预设的第一线程基于所述第一摘要提取层中的正则表达式从所述语句集中提取标题语句;以及
    从所述第一摘要提取层中获取所述语句集的公文类型标签对应的关键词集合,并从所述语句集中提取包含所述关键词集合中的关键词的关键语句。
  3. 根据权利要求1所述的公文摘要提取方法,其中,所述并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,包括:
    并发调用预设的第二线程根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数;以及
    从所述语句集中获取主标题语句,并计算所述语句集中每个语句与所述主标题语句之间的相似度;
    根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值。
  4. 根据权利要求3所述的公文摘要提取方法,其中,所述根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数,包括:
    根据所述语句集中每个语句的位置编号,确定最大位置编号,并计算所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值;
    根据每个所述差值绝对值和最大位置编号,确定所述语句集中每个语句的权重系数;
    根据所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值以及每个语句的权重系数,确定每个语句的位置表征指数。
  5. 根据权利要求3所述的公文摘要提取方法,其中,所述计算所述语句集中每个语句与所述主标题语句之间的相似度,包括:
    确定所述语句集中每个语句各自对应的文字个数,并确定所述主标题语句的标题字数;
    统计每个语句和所述主标题语句中相同文字的个数,得到每个语句各自对应的相同文字个数;
    根据所述标题字数以及每个语句各自对应的所述文字个数和所述相同文字个数,计算所述语句集中每个语句与所述主标题语句之间的相似度。
  6. 根据权利要求3所述的公文摘要提取方法,其中,所述根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值,包括:
    获取预设的第一权重系数和第二权重系数;
    根据所述第一权重系数和每个语句的位置表征指数,确定每个语句的第一重要程度值;
    根据所述第二权重系数以及每个语句与所述主标题语句之间的相似度,确定每个语句的第二重要程度值;
    根据每个语句的所述第一重要程度值和第二重要程度值,确定所述语句集中每个语句的重要程度值。
  7. 根据权利要求1至6中任一项所述的公文摘要提取方法,其中,所述基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集,包括:
    将所述第一候选摘要集与所述第二候选摘要集的交集写入空白的摘要结果集,以更新所述摘要结果集;
    从所述第二候选摘要集中去除所述交集,以更新所述第二候选摘要集;
    根据更新后的第二候选摘要集中每个语句的重要程度值,对更新后的第二候选摘要集中的每个语句进行排序;
    根据更新后的第二候选摘要集中每个语句的排序,依次将更新后的第二候选摘要集中的语句写入所述摘要结果集,直至所述摘要结果集中的语句个数达到预设语句个数。
  8. 一种公文摘要提取装置,其中,所述公文摘要提取装置包括:
    获取模块,用于获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
    第一提取模块,用于调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
    第二提取模块,用于并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
    摘要确定模块,用于基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
  9. 一种计算机设备,其中,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如下步骤:
    获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
    调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
    并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
    基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
  10. 根据权利要求9所述的计算机设备,其中,所述调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,包括:
    调用预设的第一线程基于所述第一摘要提取层中的正则表达式从所述语句集中提取标题语句;以及
    从所述第一摘要提取层中获取所述语句集的公文类型标签对应的关键词集合,并从所述语句集中提取包含所述关键词集合中的关键词的关键语句。
  11. 根据权利要求9所述的计算机设备,其中,所述并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,包括:
    并发调用预设的第二线程根据所述语句集中每个语句的位置编号,计算每个语句的位 置表征指数;以及
    从所述语句集中获取主标题语句,并计算所述语句集中每个语句与所述主标题语句之间的相似度;
    根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值。
  12. 根据权利要求11所述的计算机设备,其中,所述根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数,包括:
    根据所述语句集中每个语句的位置编号,确定最大位置编号,并计算所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值;
    根据每个所述差值绝对值和最大位置编号,确定所述语句集中每个语句的权重系数;
    根据所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值以及每个语句的权重系数,确定每个语句的位置表征指数。
  13. 根据权利要求11所述的计算机设备,其中,所述计算所述语句集中每个语句与所述主标题语句之间的相似度,包括:
    确定所述语句集中每个语句各自对应的文字个数,并确定所述主标题语句的标题字数;
    统计每个语句和所述主标题语句中相同文字的个数,得到每个语句各自对应的相同文字个数;
    根据所述标题字数以及每个语句各自对应的所述文字个数和所述相同文字个数,计算所述语句集中每个语句与所述主标题语句之间的相似度。
  14. 根据权利要求11所述的计算机设备,其中,所述根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值,包括:
    获取预设的第一权重系数和第二权重系数;
    根据所述第一权重系数和每个语句的位置表征指数,确定每个语句的第一重要程度值;
    根据所述第二权重系数以及每个语句与所述主标题语句之间的相似度,确定每个语句的第二重要程度值;
    根据每个语句的所述第一重要程度值和第二重要程度值,确定所述语句集中每个语句的重要程度值。
  15. 根据权利要求9至14中任一项所述的计算机设备,其中,所述基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集,包括:
    将所述第一候选摘要集与所述第二候选摘要集的交集写入空白的摘要结果集,以更新所述摘要结果集;
    从所述第二候选摘要集中去除所述交集,以更新所述第二候选摘要集;
    根据更新后的第二候选摘要集中每个语句的重要程度值,对更新后的第二候选摘要集中的每个语句进行排序;
    根据更新后的第二候选摘要集中每个语句的排序,依次将更新后的第二候选摘要集中的语句写入所述摘要结果集,直至所述摘要结果集中的语句个数达到预设语句个数。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如下步骤:
    获取语句集和预设的公文摘要抽取模型,其中,所述语句集包括根据待提取的公文文本确定的若干语句,所述公文摘要抽取模型包括第一摘要提取层、第二摘要提取层和摘要融合提取层;
    调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,并将所述标题语句和关键语句作为第一候选摘要集;以及
    并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,并根据每个语句的重要程度值确定第二候选摘要集;
    基于所述摘要融合提取层,根据所述第一候选摘要集和第二候选摘要集,确定所述公文文本的摘要结果集。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述调用预设的第一线程基于所述第一摘要提取层从所述语句集中提取标题语句和关键语句,包括:
    调用预设的第一线程基于所述第一摘要提取层中的正则表达式从所述语句集中提取标题语句;以及
    从所述第一摘要提取层中获取所述语句集的公文类型标签对应的关键词集合,并从所述语句集中提取包含所述关键词集合中的关键词的关键语句。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述并发调用预设的第二线程基于所述第二摘要提取层计算所述语句集中每个语句的重要程度值,包括:
    并发调用预设的第二线程根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数;以及
    从所述语句集中获取主标题语句,并计算所述语句集中每个语句与所述主标题语句之间的相似度;
    根据每个语句与所述主标题语句之间的相似度和每个语句的位置表征指数,确定所述语句集中每个语句的重要程度值。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述语句集中每个语句的位置编号,计算每个语句的位置表征指数,包括:
    根据所述语句集中每个语句的位置编号,确定最大位置编号,并计算所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值;
    根据每个所述差值绝对值和最大位置编号,确定所述语句集中每个语句的权重系数;
    根据所述语句集中每个语句的位置编号与所述最大位置编号的差值绝对值以及每个语句的权重系数,确定每个语句的位置表征指数。
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述计算所述语句集中每个语句与所述主标题语句之间的相似度,包括:
    确定所述语句集中每个语句各自对应的文字个数,并确定所述主标题语句的标题字数;
    统计每个语句和所述主标题语句中相同文字的个数,得到每个语句各自对应的相同文字个数;
    根据所述标题字数以及每个语句各自对应的所述文字个数和所述相同文字个数,计算所述语句集中每个语句与所述主标题语句之间的相似度。
PCT/CN2020/112348 2020-02-18 2020-08-31 公文摘要提取方法、装置、设备及计算机可读存储介质 WO2021164231A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010100140.1 2020-02-18
CN202010100140.1A CN111460131A (zh) 2020-02-18 2020-02-18 公文摘要提取方法、装置、设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021164231A1 true WO2021164231A1 (zh) 2021-08-26

Family

ID=71681463

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112348 WO2021164231A1 (zh) 2020-02-18 2020-08-31 公文摘要提取方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111460131A (zh)
WO (1) WO2021164231A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918708A (zh) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 一种摘要抽取方法
CN116501862A (zh) * 2023-06-25 2023-07-28 西安杰出科技有限公司 一种基于动态分布式汇集的文本自动摘录系统

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460131A (zh) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 公文摘要提取方法、装置、设备及计算机可读存储介质
CN112183077A (zh) * 2020-10-13 2021-01-05 京华信息科技股份有限公司 一种基于模式识别的公文摘要抽取方法及系统
CN112231468A (zh) * 2020-10-15 2021-01-15 平安科技(深圳)有限公司 信息生成方法、装置、电子设备及存储介质
CN112507968B (zh) * 2020-12-24 2024-03-05 成都网安科技发展有限公司 基于特征关联的公文文本识别方法和装置
CN114201601B (zh) * 2021-12-10 2023-03-28 北京金堤科技有限公司 舆情文本的摘要抽取方法、装置、设备及计算机存储介质
CN114201600A (zh) * 2021-12-10 2022-03-18 北京金堤科技有限公司 舆情文本的摘要抽取方法、装置、设备及计算机存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7017114B2 (en) * 2000-09-20 2006-03-21 International Business Machines Corporation Automatic correlation method for generating summaries for text documents
CN101393545A (zh) * 2008-11-06 2009-03-25 新百丽鞋业(深圳)有限公司 一种利用关联模型实现自动摘要的方法
CN101398814A (zh) * 2007-09-26 2009-04-01 北京大学 一种同时抽取文档摘要和关键词的方法及系统
CN104636465A (zh) * 2015-02-10 2015-05-20 百度在线网络技术(北京)有限公司 网页摘要生成方法、展示方法及相应装置
CN106599148A (zh) * 2016-12-02 2017-04-26 东软集团股份有限公司 一种文摘生成方法及装置
CN108509413A (zh) * 2018-03-08 2018-09-07 平安科技(深圳)有限公司 文摘自动提取方法、装置、计算机设备及存储介质
CN110674296A (zh) * 2019-09-17 2020-01-10 上海仪电(集团)有限公司中央研究院 一种基于关键词的资讯摘要提取方法及系统
CN111460131A (zh) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 公文摘要提取方法、装置、设备及计算机可读存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7017114B2 (en) * 2000-09-20 2006-03-21 International Business Machines Corporation Automatic correlation method for generating summaries for text documents
CN101398814A (zh) * 2007-09-26 2009-04-01 北京大学 一种同时抽取文档摘要和关键词的方法及系统
CN101393545A (zh) * 2008-11-06 2009-03-25 新百丽鞋业(深圳)有限公司 一种利用关联模型实现自动摘要的方法
CN104636465A (zh) * 2015-02-10 2015-05-20 百度在线网络技术(北京)有限公司 网页摘要生成方法、展示方法及相应装置
CN106599148A (zh) * 2016-12-02 2017-04-26 东软集团股份有限公司 一种文摘生成方法及装置
CN108509413A (zh) * 2018-03-08 2018-09-07 平安科技(深圳)有限公司 文摘自动提取方法、装置、计算机设备及存储介质
CN110674296A (zh) * 2019-09-17 2020-01-10 上海仪电(集团)有限公司中央研究院 一种基于关键词的资讯摘要提取方法及系统
CN111460131A (zh) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 公文摘要提取方法、装置、设备及计算机可读存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918708A (zh) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 一种摘要抽取方法
CN116501862A (zh) * 2023-06-25 2023-07-28 西安杰出科技有限公司 一种基于动态分布式汇集的文本自动摘录系统
CN116501862B (zh) * 2023-06-25 2023-09-12 桂林电子科技大学 一种基于动态分布式汇集的文本自动摘录系统

Also Published As

Publication number Publication date
CN111460131A (zh) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2021164231A1 (zh) 公文摘要提取方法、装置、设备及计算机可读存储介质
WO2021017721A1 (zh) 智能问答方法、装置、介质及电子设备
WO2020232882A1 (zh) 命名实体识别方法、装置、设备及计算机可读存储介质
US11669795B2 (en) Compliance management for emerging risks
WO2021114810A1 (zh) 基于图结构的公文推荐方法、装置、计算机设备及介质
WO2021047373A1 (zh) 基于大数据的列数据处理方法、设备及介质
US20220391426A1 (en) Multi-system-based intelligent question answering method and apparatus, and device
CN113836314B (zh) 知识图谱构建方法、装置、设备以及存储介质
CN112035480A (zh) 数据表管理方法、装置、设备及存储介质
WO2021169217A1 (zh) 摘要提取方法、装置、设备及计算机可读存储介质
WO2023045184A1 (zh) 一种文本类别识别方法、装置、计算机设备及介质
WO2023240878A1 (zh) 一种资源识别方法、装置、设备以及存储介质
WO2021027149A1 (zh) 基于画像相似性的信息检索推荐方法、装置及存储介质
CN113836316B (zh) 三元组数据的处理方法、训练方法、装置、设备及介质
CN111814481A (zh) 购物意图识别方法、装置、终端设备及存储介质
CN116955856A (zh) 信息展示方法、装置、电子设备以及存储介质
CN114547257B (zh) 类案匹配方法、装置、计算机设备及存储介质
WO2022257455A1 (zh) 一种相似文本的确定方法、装置、终端设备及存储介质
WO2022198747A1 (zh) 三元组信息的抽取方法、装置、电子设备和存储介质
CN115328898A (zh) 一种数据处理方法、装置、电子设备及介质
CN113157964A (zh) 一种语音搜索数据集的方法、装置及电子设备
WO2015159702A1 (ja) 部分情報抽出システム
CN112926297A (zh) 处理信息的方法、装置、设备和存储介质
CN111382244B (zh) 一种深度检索匹配分类方法、装置及终端设备
CN113971216B (zh) 数据处理方法、装置、电子设备和存储器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20919709

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20919709

Country of ref document: EP

Kind code of ref document: A1