WO2021169217A1 - 摘要提取方法、装置、设备及计算机可读存储介质 - Google Patents

摘要提取方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021169217A1
WO2021169217A1 PCT/CN2020/112340 CN2020112340W WO2021169217A1 WO 2021169217 A1 WO2021169217 A1 WO 2021169217A1 CN 2020112340 W CN2020112340 W CN 2020112340W WO 2021169217 A1 WO2021169217 A1 WO 2021169217A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
sentences
abstract
candidate
candidate set
Prior art date
Application number
PCT/CN2020/112340
Other languages
English (en)
French (fr)
Inventor
郑立颖
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021169217A1 publication Critical patent/WO2021169217A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the technical field of data processing, and in particular to an abstract extraction method, device, equipment, and computer-readable storage medium.
  • extractive and generative refers to directly extracting important sentences from the text, and then sorting and combining the sentences to output as the final summary; generative refers to extracting and summarizing based on the original content , Allowing new words or sentences to be generated to form a summary.
  • generative abstracts require a lot of annotation data, and the annotation of abstracts does not have a unified standard and is time-consuming, and cannot accurately extract text abstracts.
  • the commonly used extractive abstract method is TextRank, but the original TextRank method The abstract is only extracted based on the similarity of sentences, and the extracted sentences have redundancy, and the accuracy of abstract extraction is low. Therefore, how to improve the accuracy of abstract extraction is a problem that needs to be solved urgently.
  • the main purpose of this application is to provide an abstract extraction method, device, equipment, and computer-readable storage medium, aiming to improve the accuracy of abstract extraction.
  • An abstract extraction method includes:
  • a summary extraction device which includes:
  • An obtaining module used to obtain a sentence set of a target text, wherein the target text is the text of the abstract to be extracted;
  • the first abstract screening module is used to calculate the sentence similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter the first abstract candidate set from the sentence set according to the sentence similarity;
  • the second summary screening module is used to calculate the cosine similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter the second summary candidate set from the sentence set according to the cosine similarity;
  • the third abstract screening module is used to screen out the third abstract candidate set from the first abstract candidate set and the fourth abstract from the second abstract candidate set based on the maximum marginal correlation MMR algorithm and the number of preset sentences Candidate set
  • the selection module is configured to select sentences with a preset number of abstract sentences from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set to form a fusion abstract candidate set;
  • the summary determination module is used to count the number of occurrences of each sentence in the fusion abstract candidate set, and filter out the summary result set of the target text from the fusion abstract candidate set according to the occurrence times of each sentence.
  • a computer device including a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein when the computer program is executed by the processor, the following steps are implemented :
  • the computer program is executed by a processor, the following steps are implemented:
  • This application provides an abstract extraction method, device, equipment, and computer-readable storage medium, which can reduce the redundancy between extracted abstract sentences and effectively improve the accuracy of text abstract extraction.
  • FIG. 1 is a schematic flowchart of a method for extracting abstracts according to an embodiment of the application
  • FIG. 2 is a schematic diagram of the sub-step flow diagram of the abstract extraction method in FIG. 1;
  • FIG. 3 is a schematic flowchart of another abstract extraction method provided by an embodiment of the application.
  • FIG. 4 is a schematic block diagram of a device for extracting abstracts according to an embodiment of the application.
  • FIG. 5 is a schematic block diagram of another abstract extraction device provided by an embodiment of this application.
  • FIG. 6 is a schematic block diagram of the structure of a computer device related to an embodiment of the application.
  • the embodiments of the present application provide an abstract extraction method, device, equipment, and computer-readable storage medium.
  • the abstract extraction method can be applied to a server or a terminal device.
  • the server can be a single server or a server cluster composed of multiple servers.
  • the terminal device can be a mobile phone, a tablet computer, a notebook computer, or a desktop computer. , Personal digital assistants and wearable devices and other electronic devices. The following takes the server as an example for description.
  • FIG. 1 is a schematic flowchart of a method for abstracting according to an embodiment of the application.
  • the abstract extraction method includes steps S101 to S106.
  • Step S101 Obtain a sentence set of the target text, where the target text is the text of the abstract to be extracted.
  • the text of the summary to be extracted can be uploaded to the server through the terminal device, and the server splits the received text of the summary to be extracted to obtain the initial sentence set, and perform the initial sentence set Cleaning to remove characters such as punctuation marks and stop words in the initial sentence set to obtain the sentence set of the text to be extracted.
  • the server obtains the sentence set of the summary text to be extracted regularly or in real time.
  • Step S102 Calculate the sentence similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter out the first abstract candidate set from the sentence set according to the sentence similarity.
  • V i represents the importance value of the statement
  • w ji is a statement to the edge weights V i V j statement
  • d is the damping coefficient, representative of a statement in any other point probability statement, optionally 0.85
  • in (V i) and Out (V j) are set to point statement statement V i and V j statement from the statement pointed side of the set of weights w ji are two statements
  • the similarity between S i and S j , and the weight w jk is the similarity between any sentence in the sentence set pointed to by the edge starting from sentence V j and sentence S j .
  • the calculation formula for the sentence similarity between every two sentences in the sentence set is as follows:
  • ⁇ t k ⁇ t k ⁇ S i ⁇ t k ⁇ S j ⁇ is the number of words appearing in both sentences S i and S j
  • S i and S j include multiple words
  • t k is the kth Words
  • is the number of words contained in the sentence S i
  • is the number of words contained in the sentence S j.
  • the sentence similarity of every two sentences in the sentence set can be calculated by the above-mentioned similarity formula, and the first importance value of each sentence in the sentence set can be calculated by the calculation formula of the first importance value.
  • the method of filtering the first summary candidate set from the sentence set is specifically as follows: according to the first importance value of each sentence in the sentence set, Sort each sentence in the sentence set to obtain the first summary candidate set, or, according to the first importance value of each sentence in the sentence set, sort each sentence in the sentence set and sort by The sequence of the sentences is obtained from the sentence set in turn until the number of the obtained sentences reaches the set number, and each of the obtained sentences is collected to obtain the first summary candidate set.
  • the above-mentioned set number can be set based on actual conditions, which is not specifically limited in this application.
  • Step S103 Calculate the cosine similarity between every two sentences in the sentence set, and filter the second summary candidate set from the sentence set according to the cosine similarity based on the TextRank algorithm.
  • each sentence in the sentence set is coded to obtain the sentence vector corresponding to each sentence in the sentence set; according to the sentence vector corresponding to each sentence in the sentence set, the difference between every two sentences in the sentence set is calculated.
  • Cosine similarity Based on the TextRank algorithm, the second importance value of each sentence is determined according to the cosine similarity between every two sentences in the sentence set; according to the second importance value of each sentence in the sentence set, from the sentence set The second abstract candidate set is filtered out.
  • the second importance value is used to characterize the importance of the sentence in the target text. The higher the second importance value, the higher the importance of the sentence in the target text, and the lower the second importance value is in the target text.
  • the lower the importance degree in, based on the TextRank algorithm the formula for calculating the second importance value of a sentence is:
  • V i represents the importance value of the statement
  • D ji is a statement to the edge weights V i V j statement
  • d is the damping coefficient, representative of a statement in any other point probability statement, optionally 0.85
  • in (V i) and
  • Out (V j) are set to point statement statement V i and V j statement from the statement pointed side of the set of weights w ji are two statements
  • the similarity between S i and S j , and the weight w jk is the similarity between any sentence in the sentence set pointed to by the edge starting from sentence V j and sentence S j .
  • Statement is a statement the vector S i, and Is the sentence vector of sentence S j.
  • the cosine similarity of every two sentences in the sentence set can be calculated by the above-mentioned similarity formula, and the second importance value of each sentence in the sentence set can be calculated by the calculation formula of the second importance value.
  • the method for determining the sentence vector of the sentence may be: encoding each word in the sentence to obtain the word vector corresponding to each word, and calculating the average word vector according to the word vector corresponding to each word , And use the average word vector as the sentence vector of the sentence.
  • the method of filtering the second summary candidate set from the sentence set is specifically as follows: according to the second importance value of each sentence in the sentence set, Sort each sentence in the sentence set to obtain the second summary candidate set, or, according to the second importance value of each sentence in the sentence set, sort each sentence in the sentence set and sort by The sequence of the sentences is obtained from the sentence set in turn, until the number of the obtained sentences reaches the set number, and each of the obtained sentences is collected to obtain the second summary candidate set.
  • the above-mentioned set number can be set based on actual conditions, which is not specifically limited in this application.
  • Step S104 Based on the maximum marginal correlation MMR algorithm and the number of preset sentences, a third abstract candidate set is screened out from the first abstract candidate set and a fourth abstract candidate set is screened out from the second abstract candidate set.
  • the server selects the third summary candidate set and the third summary candidate set from the first summary candidate set based on the Maximal Marginal Relevance (MMR) algorithm and the number of preset sentences.
  • the fourth abstract candidate set is screened out from the second abstract candidate set.
  • the third summary candidate set is a subset of the first summary candidate set
  • the fourth summary candidate set is a subset of the second summary candidate set. It should be noted that the number of the aforementioned preset sentences can be set based on actual conditions, which is not specifically limited in this application. Through the MMR algorithm, the redundancy between sentences can be eliminated and the accuracy of abstract extraction can be improved.
  • step S104 includes sub-steps S1041 to S1047.
  • each sentence in the first summary candidate set is sorted, and the ranking number of each sentence is obtained. It should be noted that a sentence with a higher first importance value has a smaller sort number, and a sentence with a lower first importance value has a higher sort number.
  • the sentences whose ranking number is less than or equal to the preset ranking number are obtained from the first summary candidate set to form a candidate sentence set.
  • the aforementioned preset sort number can be set based on actual conditions, and this application does not specifically limit this.
  • the preset ranking number is 10
  • sentences with a ranking number less than or equal to 10 are obtained from the first abstract candidate set to form a candidate sentence set.
  • the server obtains the sentence with the highest first importance value from the candidate sentence set, and moves the sentence to a preset blank abstract candidate set to update the abstract candidate set and the candidate sentence set. For example, if the candidate sentence set includes 5 sentences, which are sentence A, sentence B, sentence C, sentence D, and sentence E, and sentence C has the highest first importance value, then the updated summary candidate set includes sentence C, update The latter candidate sentence set includes sentence A, sentence B, sentence D, and sentence E.
  • S1044 Based on the preset MMR value calculation formula, calculate the MMR value corresponding to each sentence in the summary candidate set and each sentence in the candidate sentence set according to the first importance value of each sentence in the candidate sentence set.
  • the MMR value is used to characterize the degree of similarity between the sentences in the candidate sentence set and the summary candidate set
  • the preset MMR value calculation formula is:
  • MMR i ⁇ W s (V i )-(1- ⁇ ) ⁇ sim(i,set)
  • I is the MMR MMR values in the statement of V i
  • [alpha] is a weight factor, optionally in the range 0-1
  • W s (V i) is the importance of a first value V i of the statement, statements SET candidate set, sim (i, set) for the semantic similarity between the sentence and the candidate set of sentences V i set.
  • the MMR value corresponding to each sentence in the summary candidate set and each sentence in the candidate sentence set can be calculated.
  • encode the abstract candidate set to obtain the vector corresponding to the abstract candidate set respectively encode each sentence in the candidate sentence set to obtain the vector corresponding to each sentence in the candidate sentence set; calculate the vector corresponding to the abstract candidate set Semantic similarity between the vector corresponding to each sentence in the candidate sentence set; Based on the calculation formula of the MMR value, the summary candidate is calculated according to the semantic similarity and the first importance value of each sentence in the candidate sentence set
  • the set respectively corresponds to the MMR value of each sentence in the candidate sentence set. For example, if the first importance value of a sentence in the candidate sentence set is x, and the similarity between the sentence and the summary candidate set is s, then the MMR value between the sentence and the summary candidate set is ⁇ x-( 1- ⁇ ) ⁇ s.
  • the method of encoding the summary candidate set to obtain the vector corresponding to the summary candidate set is specifically: encoding each sentence in the summary candidate set to obtain the sentence vector corresponding to each sentence in the summary candidate set; The sentence vector corresponding to each sentence in the summary candidate set is calculated, and the average vector is used as the vector of the summary candidate set.
  • the server moves the sentence with the highest MMR value to the summary candidate set to update the summary candidate set and candidate sentence set.
  • the summary candidate set includes sentence C
  • the candidate sentence set includes sentence A, sentence B, sentence D, and sentence E
  • the sentence with the highest MMR value is sentence E
  • the updated summary candidate set includes sentence C and sentence E
  • the updated candidate sentence set includes sentence A, sentence B, and sentence D.
  • the server determines whether the number of sentences in the updated summary candidate set reaches the number of preset sentences, and if the number of sentences in the updated summary candidate set does not reach the number of preset sentences, sub-step S1044 is executed, which is based on the preset MMR
  • the value calculation formula calculates the MMR value corresponding to each sentence in the summary candidate set and each sentence in the candidate sentence set according to the first importance value of each sentence in the candidate sentence set. It should be noted that the number of the aforementioned preset sentences can be set based on actual conditions, which is not specifically limited in this application.
  • the updated summary candidate set is taken as the third summary candidate set.
  • the updated summary candidate set is taken as the third summary candidate set.
  • the number of preset sentences is 5, and the updated summary candidate set includes sentence A, sentence B, sentence C, sentence D, and sentence E, totaling 5 sentences.
  • the number of sentences in the summary candidate set reaches the preset sentence Therefore, the summary candidate set including sentence A, sentence B, sentence C, sentence D, and sentence E is taken as the third summary candidate set.
  • the method of extracting the fourth summary candidate set is similar to the method of extracting the third summary candidate set, specifically: according to the second importance value of each sentence in the second summary candidate set, the first summary candidate Collect each sentence for sorting, and obtain the sort number of each sentence; obtain sentences with the sort number less than or equal to the preset sort number from the second summary candidate set to form a candidate sentence set; put the candidate sentences in the second set The sentence with the highest importance value is moved to the blank summary candidate set to update the summary candidate set and candidate sentence set; based on the preset MMR value calculation formula, according to the second importance value of each sentence in the candidate sentence set, Calculate the MMR value of the summary candidate set and each sentence in the candidate sentence set, where the MMR value is used to characterize the degree of similarity between the sentence in the candidate sentence set and the summary candidate set; move the sentence with the highest MMR value to the summary Candidate set to update the summary candidate set and candidate sentence set; determine whether the number of sentences in the updated summary candidate set reaches the number of preset sentences
  • Step S105 Select sentences with a preset number of abstract sentences from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set to form a fusion abstract candidate set.
  • the server After the server obtains the first summary candidate set, the second summary candidate set, the third summary candidate set, and the fourth summary candidate set, the server selects the first summary candidate set, the second summary candidate set, and the third summary candidate set.
  • the abstract candidate set and the fourth abstract candidate set select sentences with a preset number of abstract sentences to form a fusion abstract candidate set. It should be noted that the number of the aforementioned preset abstract sentences is less than the number of preset sentences, and the number of preset abstract sentences can be set based on actual conditions, which is not specifically limited in this application.
  • the sentences in the first summary candidate set, the second summary candidate set, the third summary candidate set, and the fourth summary candidate set are sorted, and the sentences are sorted according to the order of each sentence.
  • the first summary candidate set is [A, B, C, D, E, F, G, H, I, J]
  • the second summary candidate set is [A, B, C, D, E, G, H , I, J, K]
  • the third abstract candidate set is [C, D, E, F, G, H, I]
  • the fourth abstract candidate set is [D, E, G, H, I, J, K ]
  • the preset number of abstract sentences is 5, then the sentences selected from the first abstract candidate set are [A, B, C, D, E], and the sentences selected from the second abstract candidate set are [A, B, C, D, E],
  • the sentence selected from the third summary candidate set is [C, D, E, F, G], and the sentence selected from the fourth summary candidate set is [D, E, G, H, I], so
  • the fusion abstract candidate set is ⁇ [A, B, C, D, E], [A, B, C, D, E], [C, D, E, F, G], [D, E,
  • Step S106 Count the number of occurrences of each sentence in the fusion abstract candidate set, and filter out the summary result set of the target text from the fusion abstract candidate set according to the number of occurrences of each sentence.
  • the number of occurrences is the number of occurrences of the sentence in the fusion abstract candidate set.
  • the fusion abstract candidate set is ⁇ [A, B, C, D, E], [A, B, C, D, E], [C, D, E, F, G], [D, E, G ,H,I] ⁇ , then the number of occurrences of sentence A is 2, the number of occurrences of sentence B is 2, the number of occurrences of sentence C is 3, the number of occurrences of sentence D is 4, the number of occurrences of sentence E is 4, and the number of occurrences of sentence F is 4.
  • the number of occurrences of is 1, the number of occurrences of sentence G is 2, the number of occurrences of sentence H is 1, and the number of occurrences of sentence I is 1.
  • the abstract extraction method provided in the above embodiment uses the TextRank algorithm to filter out the first abstract candidate set from the sentence set according to the sentence similarity between every two sentences in the sentence set, and uses the TextRank algorithm to filter out the first summary candidate set from the sentence set according to every two sentences in the sentence set.
  • the second summary candidate set is filtered from the sentence set, and then the third summary candidate set is filtered from the first summary candidate set and the second summary based on the maximum marginal correlation MMR algorithm and the number of preset sentences.
  • the fourth summary candidate set is selected from the candidate set, and finally the four summary candidate sets are merged to determine the summary result set of the text, which can reduce the redundancy between the extracted summary sentences and effectively improve the text summary extraction accuracy.
  • FIG. 3 is a schematic flowchart of another abstract extraction method provided by an embodiment of the application.
  • the abstract extraction method includes steps S201 to 208.
  • Step S201 Obtain a sentence set of the target text, where the target text is the text of the abstract to be extracted.
  • the text of the summary to be extracted can be uploaded to the server through the terminal device, and the server splits the received text of the summary to be extracted to obtain the initial sentence set, and perform the initial sentence set Cleaning to remove characters such as punctuation marks and stop words in the initial sentence set to obtain the sentence set of the text to be extracted.
  • the server obtains the sentence set of the summary text to be extracted regularly or in real time.
  • Step S202 Calculate the sentence similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter out the first abstract candidate set from the sentence set according to the sentence similarity.
  • Step S203 Calculate the cosine similarity between every two sentences in the sentence set, and filter the second summary candidate set from the sentence set according to the cosine similarity based on the TextRank algorithm.
  • Step S204 Based on the maximum marginal correlation MMR algorithm and the number of preset sentences, a third abstract candidate set is screened out from the first abstract candidate set and a fourth abstract candidate set is screened out from the second abstract candidate set.
  • the server selects the third summary candidate set and the third summary candidate set from the first summary candidate set based on the Maximal Marginal Relevance (MMR) algorithm and the number of preset sentences.
  • the fourth abstract candidate set is screened out from the second abstract candidate set.
  • the third summary candidate set is a subset of the first summary candidate set
  • the fourth summary candidate set is a subset of the second summary candidate set.
  • Step S205 Select sentences with a preset number of abstract sentences from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set to form a fusion abstract candidate set.
  • the server After the server obtains the first summary candidate set, the second summary candidate set, the third summary candidate set, and the fourth summary candidate set, the server selects the first summary candidate set, the second summary candidate set, and the third summary candidate set.
  • the abstract candidate set and the fourth abstract candidate set select sentences with a preset number of abstract sentences to form a fusion abstract candidate set. It should be noted that the number of the aforementioned preset abstract sentences is less than the number of preset sentences, and the number of preset abstract sentences can be set based on actual conditions, which is not specifically limited in this application.
  • Step S206 Count the number of occurrences of each sentence in the fusion abstract candidate set, and determine whether the number of sentences with the number of occurrences greater than the preset number of occurrences is greater than or equal to the preset number of abstract sentences.
  • the number of occurrences of each sentence in the fusion abstract candidate set is counted, and it is determined whether the number of sentences whose occurrence number is greater than the preset number of occurrences is greater than or equal to the preset number of abstract sentences.
  • the number of occurrences is the number of occurrences of the sentence in the fusion abstract candidate set. It should be noted that the number of abstract sentences mentioned above can be set based on actual conditions, which is not specifically limited in this application.
  • Step S207 If the number of sentences with the number of occurrences greater than the preset number of occurrences is greater than or equal to the preset number of abstract sentences, sort the sentences in the fusion abstract candidate set according to the number of occurrences.
  • the sentences in the fusion abstract candidate set are sorted according to the number of occurrences. It should be noted that the sentence with the larger the number of occurrences is sorted forward, and the sentence with the smaller the number of occurrences is sorted backward.
  • the sentences in the fusion abstract candidate set whose occurrence times are greater than the preset number of occurrences are moved to the summary result of the target text
  • Concentrate to update the fusion abstract candidate set obtain the importance value of each sentence in the updated fusion abstract candidate set, and sort the sentences in the updated fusion abstract candidate set according to the importance value; according to the updated fusion abstract
  • the order of each sentence in the candidate set is to sequentially select sentences from the updated fusion abstract candidate set and write them into the summary result set until the number of sentences in the summary result set reaches the preset number of summary sentences.
  • Step S208 According to the ranking of each sentence in the fusion abstract candidate set, select sentences from the fusion abstract candidate set and write them into the summary result set of the target text in turn, until the number of sentences in the summary result set reaches a preset value The number of summary sentences.
  • the fusion abstract candidate set is ⁇ [A, B, C, D, E], [A, B, C, D, E], [C, D, E, F, G], [D, E, G ,H,I] ⁇
  • the number of occurrences of sentence A is 2
  • the number of occurrences of sentence B is 2
  • the number of occurrences of sentence C is 3
  • the number of occurrences of sentence D is 4
  • the number of occurrences of sentence E is 4
  • the number of occurrences of sentence F is 4.
  • the number of occurrences of sentence G is 1, the number of occurrences of sentence G is 2, the number of occurrences of sentence H is 1, and the number of occurrences of sentence I is 1.
  • the order of each sentence in the fusion abstract candidate set is [D, E, C, A, B , G, F, H, I]
  • the number of summary sentences is 5, and the preset number of occurrences is 2, then the summary result set of the target text is [D, E, C, A, B].
  • the abstract extraction method uses the TextRank algorithm to filter out the first abstract candidate set according to the sentence similarity between every two sentences in the sentence set, and the TextRank algorithm according to the cosine similarity between every two sentences
  • the second abstract candidate set is screened out, and then based on the MMR algorithm, the third abstract candidate set is screened out from the first abstract candidate set and the fourth abstract candidate set is screened out from the second abstract candidate set, and from these four abstract candidates
  • selecting sentences from the fusion summary candidate set are written into the summary result set of the target text, which can reduce the redundancy between the extracted summary sentences and effectively improve the accuracy of text summary extraction.
  • FIG. 4 is a schematic block diagram of an abstract extraction device provided by an embodiment of the application.
  • the abstract extraction device 300 includes: an acquisition module 301, a first abstract screening module 302, a second abstract screening module 303, a third abstract screening module 304, a selection module 305, and a abstract determination module 306.
  • the acquisition module 301 is used to acquire the sentence set of the target text, where the target text is the text of the summary to be extracted;
  • the first abstract screening module 302 is configured to calculate the sentence similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter out the first abstract candidate set from the sentence set according to the sentence similarity;
  • the second summary screening module 303 is configured to calculate the cosine similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter the second summary candidate set from the sentence set according to the cosine similarity;
  • the third summary screening module 304 is configured to filter out the third summary candidate set from the first summary candidate set and the fourth summary candidate set from the second summary candidate set based on the maximum marginal correlation MMR algorithm and the number of preset sentences. Summary candidate set;
  • the selection module 305 is configured to select sentences with a preset number of abstract sentences from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set to form a fusion abstract candidate set;
  • the summary determination module 306 is configured to count the number of occurrences of each sentence in the fusion abstract candidate set, and filter out the summary result set of the target text from the fusion abstract candidate set according to the occurrence number of each sentence.
  • the first summary screening module 302 is further configured to:
  • the first importance value of each sentence is determined according to the sentence similarity between every two sentences in the sentence set, where the first importance value is used to characterize the sentence in the target text.
  • the first summary candidate set is filtered from the sentence set.
  • the second summary screening module 303 is further configured to:
  • the second importance value of each sentence is determined according to the cosine similarity between every two sentences in the sentence set, where the second importance value is used to characterize the sentence in the target text.
  • a second summary candidate set is filtered from the sentence set.
  • the third summary screening module 304 is further configured to:
  • the MMR value corresponding to each sentence in the summary candidate set and each sentence in the candidate sentence set is calculated.
  • the MMR value is used to characterize the degree of similarity between the sentences in the candidate sentence set and the summary candidate set;
  • the updated summary candidate set is used as the third summary candidate set.
  • the third summary screening module 304 is further configured to:
  • the MMR value corresponding to each sentence in the summary candidate set and each sentence in the candidate sentence set is calculated.
  • FIG. 5 is a schematic block diagram of another abstract extraction device provided by an embodiment of the application.
  • the abstract extraction device 400 includes: an acquisition module 401, a first abstract screening module 402, a second abstract screening module 403, a third abstract screening module 404, a selection module 405, a determination module 406, and a sorting module 407 And the summary determination module 408.
  • the obtaining module 401 is configured to obtain a sentence set of a target text, where the target text is the text of a summary to be extracted;
  • the first summary screening module 402 is configured to calculate the sentence similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter the first summary candidate set from the sentence set according to the sentence similarity;
  • the second summary screening module 403 is configured to calculate the cosine similarity between every two sentences in the sentence set, and based on the TextRank algorithm, filter the second summary candidate set from the sentence set according to the cosine similarity;
  • the third summary screening module 404 is configured to filter out the third summary candidate set from the first summary candidate set and the fourth summary candidate set from the second summary candidate set based on the maximum marginal correlation MMR algorithm and the number of preset sentences. Summary candidate set;
  • the selection module 405 is configured to select sentences with a preset number of abstract sentences from the first abstract candidate set, the second abstract candidate set, the third abstract candidate set, and the fourth abstract candidate set to form a fusion abstract candidate set;
  • the determining module 406 is configured to determine whether the number of sentences whose occurrence times are greater than the preset number of occurrences is greater than or equal to the preset number of abstract sentences;
  • the sorting module 407 is configured to sort the sentences in the fusion abstract candidate set according to the number of appearances if the number of sentences with the number of appearances greater than the preset number of appearances is greater than or equal to the preset number of abstract sentences;
  • the summary determination module 408 is configured to select sentences from the fusion summary candidate set and write them into the summary result set of the target text in order according to the order of each sentence in the fusion summary candidate set, until the sum of the sentences in the summary result set The number reaches the preset number of summary sentences.
  • the summary determining module 408 is further configured to:
  • the apparatus provided in the foregoing embodiment may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 6.
  • FIG. 6 is a schematic block diagram of a structure of a computer device provided by an embodiment of the application.
  • the computer device can be a server or a terminal device.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute the following steps:
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute any abstract extraction method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer-readable storage medium may be non-volatile or volatile, and the computer
  • the program includes program instructions, and the following steps are implemented when the program instructions are executed:
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) ) Card, Flash Card, etc.
  • a plug-in hard disk equipped on the computer device such as a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) ) Card, Flash Card, etc.
  • SD Secure Digital
  • all the above-mentioned data can also be stored in a node of a blockchain.
  • the first summary candidate set, the second summary candidate set and the target text, etc., these data can all be stored in the blockchain node.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种摘要提取方法、装置、设备及计算机可读存储介质,该方法包括:计算语句集中每两个语句之间的句子相似度,并基于TextRank算法和句子相似度,从语句集中筛选出第一摘要候选集(S102);计算语句集中每两个语句之间的余弦相似度,并基于TextRank算法和余弦相似度,从语句集中筛选出第二摘要候选集(S103);基于MMR算法和预设语句个数,分别从第一摘要候选集和第二摘要候选集中筛选出第三摘要候选集第四摘要候选集(S104);分别四个摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集(S105);统计融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从融合摘要候选集中筛选出目标文本的摘要结果集(S106)。该方法涉及数据处理,可以提高摘要提取的准确性。

Description

摘要提取方法、装置、设备及计算机可读存储介质
本申请要求于2020年2月27日提交中国专利局、申请号为CN202010125189.2,发明名称为“摘要提取方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理的技术领域,尤其涉及一种摘要提取方法、装置、设备及计算机可读存储介质。
背景技术
目前,摘要技术主要分为抽取式和生成式两大类,抽取式指直接从文中抽取重要的句子,再将句子进行排序组合后输出作为最终的摘要;生成式是指根据原文内容进行提炼总结,允许有新的词语或者句子生成来形成摘要。然而,发明人意识到生成式摘要需要大量的标注数据,而摘要的标注没有统一的标准且比较耗时,无法准确的提取文本的摘要,而常用的抽取式摘要方法是TextRank,但是原始TextRank方法只是基于句子的相似度抽取摘要,且抽取出的句子存在冗余性,摘要提取的准确性较低。因此,如何提高摘要提取的准确性是目前亟待解决的问题。
发明内容
本申请的主要目的在于提供一种摘要提取方法、装置、设备及计算机可读存储介质,旨在提高摘要提取的准确性。
一种摘要提取方法,该方法包括:
获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
一种摘要提取装置,该装置包括:
获取模块,用于获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
第一摘要筛选模块,用于计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
第二摘要筛选模块,用于计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
第三摘要筛选模块,用于基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
选择模块,用于分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
摘要确定模块,用于统计所述融合摘要候选集中各语句的出现次数,并根据各语句的 出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
一种计算机设备,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如下步骤:
获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如下步骤:
获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
本申请提供一种摘要提取方法、装置、设备及计算机可读存储介质,可以降低提取到的摘要语句之间的冗余性,有效的提高文本摘要提取的准确性。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种摘要提取方法的流程示意图;
图2为图1中的摘要提取方法的子步骤流程示意图;
图3为本申请实施例提供的另一种摘要提取方法的流程示意图;
图4为本申请实施例提供的一种摘要提取装置的示意性框图;
图5为本申请实施例提供的另一种摘要提取装置的示意性框图;
图6为本申请一实施例涉及的计算机设备的结构示意框图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地 描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本申请实施例提供一种摘要提取方法、装置、设备及计算机可读存储介质。其中,该摘要提取方法可应用于服务器或终端设备中,该服务器可以为单台的服务器,也可以为由多台服务器组成的服务器集群,该终端设备可以手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。以下以服务器为例进行说明。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参照图1,图1为本申请的实施例提供的一种摘要提取方法的流程示意图。
如图1所示,该摘要提取方法包括步骤S101至步骤S106。
步骤S101、获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本。
当用户需要提取文本中的摘要时,可以通过终端设备将待提取摘要的文本上传至服务器,服务器对接收到的待提取摘要的文本进行语句拆分,得到初始语句集,并对初始语句集进行清洗,以去除初始语句集中的标点符号和停用词等字符,得到待提取摘要的文本的语句集。服务器定时或实时获取待提取摘要文本的语句集。
步骤S102、计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集。
在获取到目标文本的语句集之后,计算该语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据该语句集中每两个语句之间的句子相似度,从该语句集中筛选出第一摘要候选集。
具体地,统计该语句集中每两个语句的相同词的数量和语句集中每个语句包含的词的个数;根据语句集中每两个语句的相同词的数量和语句集中每个语句包含的词的个数,计算语句集中每两个语句的句子相似度;基于TextRank算法,根据语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值;根据语句集中每个语句的第一重要性值,从语句集中筛选出第一摘要候选集。其中,第一重要性值用于表征语句在目标文本中的重要程度,第一重要性值越高的语句在目标文本中的重要程度越高,第一重要性值越低的语句在目标文本中的重要程度越低,基于TextRank算法,计算语句的第一重要性值的公式为:
Figure PCTCN2020112340-appb-000001
其中,等式左侧的W S(V i)表示语句V i的重要性值,w ji为语句V i到语句V j的边的权值,d为阻尼系数,代表某一语句指向其他任意语句的概率,可选为0.85,In(V i)和Out(V j)分别为指向语句V i的语句集合和从语句V j出发的边指向的语句集合,权值w ji是两个语句S i和S j的相似度,权值w jk是从语句V j出发的边指向的语句集合中的任意一个语句与语句S j的相似度。该语句集中每两个语句之间的句子相似度的计算公式如下所示:
Figure PCTCN2020112340-appb-000002
其中,{t k∨t k∈S i∧t k∈S j}为两个语句S i和S j中都出现的词的数量,S i和S j包括多个词,t k是第k个词,|S i|是语句S i中包含的词的个数,|S j|是语句S j中包含的词的个数。通过上述相似度公式即可计算得到语句集中每两个语句的句子相似度,通过第一重要性值的计算公式即可计算得到语句集每个语句的第一重要性值。
在一实施例中,根据语句集中每个语句的第一重要性值,从语句集中筛选出第一摘要候选集的方式具体为:按照该语句集中每个语句的第一重要性值的高低,对该语句集中的每个语句进行排序,得到第一摘要候选集,或者,按照该语句集中每个语句的第一重要性值的高低,对该语句集中的每个语句进行排序,并按照排序的先后顺序,依次从该语句集中获取语句,直到获取到的语句的个数达到设定的个数,汇集获取到的每个语句,从而得到第一摘要候选集。需要说明的是,上述设定的个数可基于实际情况进行设置,本申请对此不作具体限定。
步骤S103、计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集。
在获取到目标文本的语句集之后,计算该语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据该语句集中每两个语句之间的余弦相似度,从该语句集中筛选出第一摘要候选集。
具体地,对该语句集中的每个语句进行编码,得到语句集中的每个语句各自对应的语句向量;根据语句集中的每个语句各自对应的语句向量,计算语句集中每两个语句之间的余弦相似度;基于TextRank算法,根据语句集中每两个语句之间的余弦相似度,确定每个语句的第二重要性值;根据该语句集中每个语句的第二重要性值,从语句集中筛选出第二摘要候选集。其中,第二重要性值用于表征语句在目标文本中的重要程度,第二重要性值越高的语句在目标文本中的重要程度越高,第二重要性值越低的语句在目标文本中的重要程度越低,基于TextRank算法,计算语句的第二重要性值的公式为:
Figure PCTCN2020112340-appb-000003
其中,等式左侧的W S(V i)表示语句V i的重要性值,D ji为语句V i到语句V j的边的权值,d为阻尼系数,代表某一语句指向其他任意语句的概率,可选为0.85,In(V i)和Out(V j)分别为指向语句V i的语句集合和从语句V j出发的边指向的语句集合,权值w ji是两个语句S i和S j的相似度,权值w jk是从语句V j出发的边指向的语句集合中的任意一个语句与语句S j的相似度。
其中,两个语句S i和S j的余弦相似度的计算公式为:
Figure PCTCN2020112340-appb-000004
其中,
Figure PCTCN2020112340-appb-000005
为语句S i的语句向量,
Figure PCTCN2020112340-appb-000006
为语句S j的语句向量。通过上述相似度公式即可计算得到语句集中每两个语句的余弦相似度,通过第二重要性值的计算公式即可计算得到语句集每个语句的第二重要性值。
在一实施例中,语句的语句向量的确定方式可以为:对语句中的每个词进行编码,得到每个词各自对应的词向量,并根据每个词对应的词向量,计算平均词向量,且将该平均词向量作为该语句的语句向量。
在一实施例中,根据语句集中每个语句的第二重要性值,从语句集中筛选出第二摘要候选集的方式具体为:按照该语句集中每个语句的第二重要性值的高低,对该语句集中的每个语句进行排序,得到第二摘要候选集,或者,按照该语句集中每个语句的第二重要性值的高低,对该语句集中的每个语句进行排序,并按照排序的先后顺序,依次从该语句集中获取语句,直到获取到的语句的个数达到设定的个数,汇集获取到的每个语句,从而得到第二摘要候选集。需要说明的是,上述设定的个数可基于实际情况进行设置,本申请对此不作具体限定。
步骤S104、基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集。
在筛选得到第一摘要候选集和第二摘要候选集之后,服务器基于最大边界相关 (Maximal Marginal Relevance,MMR)算法和预设语句个数,从第一摘要候选集中筛选出第三摘要候选集以及从第二摘要候选集中筛选出第四摘要候选集。其中,第三摘要候选集为第一摘要候选集的子集,第四摘要候选集为第二摘要候选集的子集。需要说明的是,上述预设语句个数可基于实际情况进行设置,本申请对此不作具体限定。通过MMR算法,可以消除语句之间的冗余性,提高摘要提取的准确性。
在一实施例中,如图2所示,步骤S104包括子步骤S1041至步骤S1047。
S1041、根据所述第一摘要候选集中每个语句的所述第一重要性值,对所述第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号。
按照第一摘要候选集中每个语句的第一重要性值的高低,对第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号。需要说明的是,第一重要性值越高的语句的排序编号越小,第一重要性值越低的语句的排序编号越大。
S1042、从所述第一摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集。
在对第一摘要候选集中每个语句进行排序之后,从第一摘要候选集中获取该排序编号小于或等于预设的排序编号的语句,以形成候选语句集。需要说明的是,上述预设的排序编号可基于实际情况进行设置,本申请对此不作具体限定。可选地,预设的排序编号为10,则从第一摘要候选集中获取排序编号小于或等于10的语句,以形成候选语句集。
S1043、将所述候选语句集中所述第一重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集。
具体地,服务器从候选语句集中获取第一重要性值最高的语句,并将该语句移存至预设的空白摘要候选集,以更新摘要候选集和候选语句集。例如,候选语句集包括5个语句,分别为语句A、语句B、语句C、语句D和语句E,且语句C的第一重要性值最高,则更新后的摘要候选集包括语句C,更新后的候选语句集包括语句A、语句B、语句D和语句E。
S1044、基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。
其中,MMR值用于表征候选语句集中的语句与摘要候选集之间的相似程度,预设的MMR值的计算公式为:
MMR i=α·W s(V i)-(1-α)·sim(i,set)
其中,MMR i为语句V i的MMR值,α为权重系数,取值范围可选为0-1,W s(V i)为语句V i的第一重要性值,set为候选语句集,sim(i,set)为语句V i与候选语句集set之间的语义相似度。根据候选语句集中每个语句的第一重要性值,和该MMR值的计算公式,即可计算得到摘要候选集分别与候选语句集中每个语句各自对应的MMR值。
具体地,对摘要候选集进行编码,得到摘要候选集对应的向量;分别对候选语句集中的每个语句进行编码,得到候选语句集中的每个语句各自对应的向量;计算摘要候选集对应的向量分别与候选语句集中的每个语句各自对应的向量之间的语义相似度;基于MMR值的计算公式,根据每个语义相似度和候选语句集中每个语句的第一重要性值,计算摘要候选集分别与候选语句集中每个语句各自对应的MMR值。例如,候选语句集中的一个语句的第一重要性值为x,且该语句与摘要候选集之间的相似度为s,则该语句与摘要候选集之间的MMR值为α·x-(1-α)·s。
其中,对摘要候选集进行编码,得到摘要候选集对应的向量的方式具体为:对该摘要候选集中的每个语句进行编码,得到该摘要候选集中的每个语句各自对应的语句向量;根据该摘要候选集中的每个语句各自对应的语句向量,计算平均向量,并将该平均向量作为该摘要候选集的向量。
S1045、将所述MMR值最高的语句移存至所述摘要候选集,以更新所述摘要候选集 和候选语句集。
在计算得到摘要候选集分别与候选语句集中每个语句各自对应的MMR值之后,服务器将MMR值最高的语句移存至该摘要候选集,以更新摘要候选集和候选语句集。例如,该摘要候选集包括语句C,候选语句集包括语句A、语句B、语句D和语句E,且MMR值最高的语句为语句E,则更新后的摘要候选集包括语句C和语句E,更新后的候选语句集包括语句A、语句B和语句D。
S1046、确定更新后的所述摘要候选集中的语句的数量是否达到预设语句个数。
服务器确定更新后的摘要候选集中的语句的数量是否达到预设语句个数,若更新后的摘要候选集中的语句数量未达到预设语句个数,则执行子步骤S1044,即基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。需要说明的是,上述预设语句个数可基于实际情况进行设置,本申请对此不作具体限定。
S1047、若更新后的所述摘要候选集中的语句的数量达到预设语句个数,则将更新后的所述摘要候选集作为第三摘要候选集。
如果更新后的摘要候选集中的语句数量达到预设语句个数,则将更新后的摘要候选集作为第三摘要候选集。例如,预设语句个数为5个,更新后的摘要候选集包括语句A、语句B、语句C、语句D和语句E,共计5个语句,此时摘要候选集中的语句数量达到预设语句个数,因此将包含语句A、语句B、语句C、语句D和语句E的摘要候选集作为第三摘要候选集。
可以理解的是,第四摘要候选集的提取方式与第三摘要候选集的提取方式类似,具体为:根据所述第二摘要候选集中每个语句的第二重要性值,对第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号;从第二摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集;将候选语句集中第二重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集;基于预设的MMR值计算公式,根据候选语句集中每个语句的第二重要性值,计算摘要候选集分别与候选语句集中每个语句各自对应的MMR值,其中,MMR值用于表征候选语句集中的语句与摘要候选集之间的相似程度;将MMR值最高的语句移存至摘要候选集,以更新摘要候选集和候选语句集;确定更新后的摘要候选集中的语句的数量是否达到预设语句个数;若更新后的摘要候选集中的语句的数量未达到预设语句个数,则执行步骤:基于预设的MMR值计算公式,根据候选语句集中每个语句的第二重要性值,计算摘要候选集分别与候选语句集中每个语句各自对应的MMR值;若更新后的摘要候选集中的语句的数量达到预设语句个数,则将更新后的摘要候选集作为第四摘要候选集。
步骤S105、分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集。
服务器在得到第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集这四个摘要候选集之后,分别从第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集。需要说明的是,上述预设摘要语句数量小于预设语句个数,预设摘要语句数量可基于实际情况进行设置,本申请对此不作具体限定。
在一实施例中,根据重要性值的大小,分别对第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中的语句进行排序,并按照各语句的排序先后,分别从第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量写入融合摘要候选集。其中,重要性值越大,则排序越靠前,重要性值越小,则排序越靠后。
例如,第一摘要候选集为[A,B,C,D,E,F,G,H,I,J],第二摘要候选集为[A, B,C,D,E,G,H,I,J,K],第三摘要候选集为[C,D,E,F,G,H,I],第四摘要候选集为[D,E,G,H,I,J,K],预设摘要语句数量为5,则从第一摘要候选集选择的语句为[A,B,C,D,E],从第二摘要候选集选择的语句为[A,B,C,D,E],从第三摘要候选集选择的语句为[C,D,E,F,G],从第四摘要候选集选择的语句为[D,E,G,H,I],因此,融合摘要候选集为{[A,B,C,D,E],[A,B,C,D,E],[C,D,E,F,G],[D,E,G,H,I]}。
步骤S106、统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
在得到融合摘要候选集之后,统计融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从融合摘要候选集中筛选出目标文本的摘要结果集,即从融合摘要候选集中筛选出该出现次数大于或等于预设出现次数的语句作为目标文本的摘要结果集。其中,该出现次数为语句在融合摘要候选集中出现的次数。
例如,融合摘要候选集为{[A,B,C,D,E],[A,B,C,D,E],[C,D,E,F,G],[D,E,G,H,I]},则语句A的出现次数为2,语句B的出现次数为2,语句C的出现次数为3,语句D的出现次数为4,语句E的出现次数为4,语句F的出现次数为1,语句G的出现次数为2,语句H的出现次数为1,语句I的出现次数为1。
上述实施例提供的摘要提取方法,通过TextRank算法,根据语句集中每两个语句之间的句子相似度,从语句集中筛选出第一摘要候选集,且通过TextRank算法,根据语句集中每两个语句之间的余弦相似度,从语句集中筛选出第二摘要候选集,然后基于最大边缘相关MMR算法和预设语句个数,从第一摘要候选集筛选出第三摘要候选集以及从第二摘要候选集中筛选出第四摘要候选集,最后对这四个摘要候选集进行融合,以确定文本的摘要结果集,可以降低提取到的摘要语句之间的冗余性,有效的提高文本摘要提取的准确性。
请参照图3,图3为本申请实施例提供的另一种摘要提取方法的流程示意图。
如图3所示,该摘要提取方法包括步骤S201至208。
步骤S201、获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本。
当用户需要提取文本中的摘要时,可以通过终端设备将待提取摘要的文本上传至服务器,服务器对接收到的待提取摘要的文本进行语句拆分,得到初始语句集,并对初始语句集进行清洗,以去除初始语句集中的标点符号和停用词等字符,得到待提取摘要的文本的语句集。服务器定时或实时获取待提取摘要文本的语句集。
步骤S202、计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集。
在获取到目标文本的语句集之后,计算该语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据该语句集中每两个语句之间的句子相似度,从该语句集中筛选出第一摘要候选集。
步骤S203、计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集。
在获取到目标文本的语句集之后,计算该语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据该语句集中每两个语句之间的余弦相似度,从该语句集中筛选出第一摘要候选集。
步骤S204、基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集。
在筛选得到第一摘要候选集和第二摘要候选集之后,服务器基于最大边界相关(Maximal Marginal Relevance,MMR)算法和预设语句个数,从第一摘要候选集中筛选出第三摘要候选集以及从第二摘要候选集中筛选出第四摘要候选集。其中,第三摘要候选集为第一摘要候选集的子集,第四摘要候选集为第二摘要候选集的子集。需要说明的是,上 述预设语句个数可基于实际情况进行设置,本申请对此不作具体限定。通过MMR算法,可以消除语句之间的冗余性,提高摘要提取的准确性。
步骤S205、分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集。
服务器在得到第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集这四个摘要候选集之后,分别从第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集。需要说明的是,上述预设摘要语句数量小于预设语句个数,预设摘要语句数量可基于实际情况进行设置,本申请对此不作具体限定。
步骤S206、统计所述融合摘要候选集中各语句的出现次数,并确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量。
在得到融合摘要候选集之后,统计融合摘要候选集中各语句的出现次数,并确定该出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量。其中,该出现次数为语句在融合摘要候选集中出现的次数。需要说明的是,上述摘要语句数量可基于实际情况进行设置,本申请对此不作具体限定。
步骤S207、若所述出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量,则根据所述出现次数,对所述融合摘要候选集中的语句进行排序。
如果该出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量,则根据该出现次数,对融合摘要候选集中的语句进行排序。需要说明的是,出现次数越大的语句的排序越靠前,出现次数越小的语句的排序越靠后。
在一实施例中,若出现次数大于预设出现次数的语句的个数小于预设的摘要语句数量,则将融合摘要候选集中出现次数大于预设出现次数的语句移存至目标文本的摘要结果集中,以更新融合摘要候选集;获取更新后的融合摘要候选集中每个语句的重要性值,并根据重要性值,对更新后的融合摘要候选集中的语句进行排序;按照更新后的融合摘要候选集中每个语句的排序,依次从更新后的融合摘要候选集中选择语句写入摘要结果集中,直至摘要结果集中的语句的数量达到预设的摘要语句数量。
步骤S208、按照所述融合摘要候选集中每个语句的排序,依次从所述融合摘要候选集中选择语句写入所述目标文本的摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
在对融合摘要候选集中的语句排序后,按照该融合摘要候选集中每个语句的排序,依次从融合摘要候选集中选择语句写入目标文本的摘要结果集中,直至摘要结果集中的语句的数量达到预设的摘要语句数量。例如,融合摘要候选集为{[A,B,C,D,E],[A,B,C,D,E],[C,D,E,F,G],[D,E,G,H,I]},则语句A的出现次数为2,语句B的出现次数为2,语句C的出现次数为3,语句D的出现次数为4,语句E的出现次数为4,语句F的出现次数为1,语句G的出现次数为2,语句H的出现次数为1,语句I的出现次数为1,因此融合摘要候选集中各语句的排序为[D、E、C、A、B、G、F、H、I],摘要语句数量为5,且预设出现次数为2,则目标文本的摘要结果集为[D、E、C、A、B]。
上述实施例提供的摘要提取方法,通过TextRank算法,根据语句集中每两个语句之间的句子相似度,筛选出第一摘要候选集,且通过TextRank算法,根据每两个语句之间的余弦相似度,筛选出第二摘要候选集,然后基于MMR算法,从第一摘要候选集筛选出第三摘要候选集以及从第二摘要候选集中筛选出第四摘要候选集,并从这四个摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;最后统计融合摘要候选集中各语句的出现次数,并在该出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量时,按照出现次数的大小顺序,从融合摘要候选集中选择语句写入目标文本的摘要结果集中,可以降低提取到的摘要语句之间的冗余性,有效的提高文本摘要提取的准确性。
请参照图4,图4为本申请实施例提供的一种摘要提取装置的示意性框图。
如图4所示,该摘要提取装置300,包括:获取模块301、第一摘要筛选模块302、第二摘要筛选模块303、第三摘要筛选模块304、选择模块305和摘要确定模块306。
获取模301,用于获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
第一摘要筛选模块302,用于计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
第二摘要筛选模块303,用于计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
第三摘要筛选模块304,用于基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
选择模块305,用于分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
摘要确定模块306,用于统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
在一个实施例中,所述第一摘要筛选模块302还用于:
统计所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数;
根据所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数,计算所述语句集中每两个语句的句子相似度;
基于TextRank算法,根据所述语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值,其中,所述第一重要性值用于表征语句在所述目标文本中的重要程度;
根据所述语句集中每个语句的第一重要性值,从所述语句集中筛选出第一摘要候选集。
在一个实施例中,所述第二摘要筛选模块303还用于:
对所述语句集中的每个语句进行编码,得到所述语句集中的每个语句各自对应的语句向量;
根据所述语句集中的每个语句各自对应的语句向量,计算所述语句集中每两个语句之间的余弦相似度;
基于TextRank算法,根据所述语句集中每两个语句之间的余弦相似度,确定每个语句的第二重要性值,其中,所述第二重要性值用于表征语句在所述目标文本中的重要程度;
根据所述语句集中每个语句的第二重要性值,从所述语句集中筛选出第二摘要候选集。
在一个实施例中,所述第三摘要筛选模块304还用于:
根据所述第一摘要候选集中每个语句的所述第一重要性值,对所述第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号;
从所述第一摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集;
将所述候选语句集中所述第一重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集;
基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,其中,所述MMR值用于表征所述候选语句集中的语句与所述摘要候选集之间的相似程度;
将所述MMR值最高的语句移存至所述摘要候选集,以更新所述摘要候选集和候选语句集;
确定更新后的所述摘要候选集中的语句的数量是否达到预设语句个数;
若更新后的所述摘要候选集中的语句的数量未达到预设语句个数,则执行步骤:基于 所述MMR算法,基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值;
若更新后的所述摘要候选集中的语句的数量达到预设语句个数,则将更新后的所述摘要候选集作为第三摘要候选集。
在一个实施例中,所述第三摘要筛选模块304还用于:
对所述摘要候选集进行编码,得到所述摘要候选集对应的向量;
分别对所述候选语句集中的每个语句进行编码,得到所述候选语句集中的每个语句各自对应的向量;
计算所述摘要候选集对应的向量分别与所述候选语句集中的每个语句各自对应的向量之间的语义相似度;
根据每个所述语义相似度和所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。
请参照图5,图5为本申请实施例提供的另一种摘要提取装置的示意性框图。
如图5所示,该摘要提取装置400,包括:获取模块401、第一摘要筛选模块402、第二摘要筛选模块403、第三摘要筛选模块404、选择模块405、确定模块406、排序模块407和摘要确定模块408。
获取模块401,用于获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
第一摘要筛选模块402,用于计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
第二摘要筛选模块403,用于计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
第三摘要筛选模块404,用于基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
选择模块405,用于分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
确定模块406,用于确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量;
排序模块407,用于若所述出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量,则根据所述出现次数,对所述融合摘要候选集中的语句进行排序;
摘要确定模块408,用于按照所述融合摘要候选集中每个语句的排序,依次从所述融合摘要候选集中选择语句写入所述目标文本的摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
在一实施例中,所述摘要确定模块408,还用于:
若所述出现次数大于预设出现次数的语句的个数小于预设的摘要语句数量,则将所述融合摘要候选集中所述出现次数大于预设出现次数的语句移存至所述目标文本的摘要结果集中,以更新所述融合摘要候选集;
获取更新后的所述融合摘要候选集中每个语句的重要性值,并根据所述重要性值,对更新后的所述融合摘要候选集中的语句进行排序;
按照更新后的所述融合摘要候选集中每个语句的排序,依次从更新后的所述融合摘要候选集中选择语句写入所述摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块及单元的具体工作过程,可以参考前述摘要提取方法实施例中的对 应过程,在此不再赘述。
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图6所示的计算机设备上运行。
请参阅图6,图6为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以为服务器或终端设备。
如图6所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行如下步骤:
获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种摘要提取方法。
所述计算机程序被执行时所实现的步骤的具体实施方式可参照本申请摘要提取方法的各个实施例。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机程序中包括程序指令,所述程序指令被执行时实现如下步骤:
获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
所述程序指令被执行时所实现的步骤的具体实施方式可参照本申请摘要提取方法的各个实施例。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
在另一实施例中,本申请所提供的摘要提取方法,为进一步保证上述所有出现的数据的私密和安全性,上述所有数据还可以存储于一区块链的节点中。例如第一摘要候选集、第二摘要候选集及目标文本等等,这些数据均可存储在区块链节点中。
需要说明的是,本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。
应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种摘要提取方法,其中,包括:
    获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
    计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
    计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
    基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
    分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
    统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
  2. 根据权利要求1所述的摘要提取方法,其中,所述计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集,包括:
    统计所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数;
    根据所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数,计算所述语句集中每两个语句的句子相似度;
    基于TextRank算法,根据所述语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值,其中,所述第一重要性值用于表征语句在所述目标文本中的重要程度;
    根据所述语句集中每个语句的第一重要性值,从所述语句集中筛选出第一摘要候选集。
  3. 根据权利要求1所述的摘要提取方法,其中,所述计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集,包括:
    对所述语句集中的每个语句进行编码,得到所述语句集中的每个语句各自对应的语句向量;
    根据所述语句集中的每个语句各自对应的语句向量,计算所述语句集中每两个语句之间的余弦相似度;
    基于TextRank算法,根据所述语句集中每两个语句之间的余弦相似度,确定每个语句的第二重要性值,其中,所述第二重要性值用于表征语句在所述目标文本中的重要程度;
    根据所述语句集中每个语句的第二重要性值,从所述语句集中筛选出第二摘要候选集。
  4. 根据权利要求2所述的摘要提取方法,其中,所述基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集,包括:
    根据所述第一摘要候选集中每个语句的所述第一重要性值,对所述第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号;
    从所述第一摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集;
    将所述候选语句集中所述第一重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集;
    基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,其中,所述MMR值用于表征所述候选语句集中的语句与所述摘要候选集之间的相似程度;
    将所述MMR值最高的语句移存至所述摘要候选集,以更新所述摘要候选集和候选语句集;
    确定更新后的所述摘要候选集中的语句的数量是否达到预设语句个数;
    若更新后的所述摘要候选集中的语句的数量未达到预设语句个数,则执行步骤:基于所述MMR算法,基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值;
    若更新后的所述摘要候选集中的语句的数量达到预设语句个数,则将更新后的所述摘要候选集作为第三摘要候选集。
  5. 根据权利要求4所述的摘要提取方法,其中,所述基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,包括:
    对所述摘要候选集进行编码,得到所述摘要候选集对应的向量;
    分别对所述候选语句集中的每个语句进行编码,得到所述候选语句集中的每个语句各自对应的向量;
    计算所述摘要候选集对应的向量分别与所述候选语句集中的每个语句各自对应的向量之间的语义相似度;
    根据每个所述语义相似度和所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。
  6. 根据权利要求1至5中任一项所述的摘要提取方法,其中,所述根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集,包括:
    确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量;
    若所述出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量,则根据所述出现次数,对所述融合摘要候选集中的语句进行排序;
    按照所述融合摘要候选集中每个语句的排序,依次从所述融合摘要候选集中选择语句写入所述目标文本的摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
  7. 根据权利要求6所述的摘要提取方法,其中,所述确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量之后,还包括:
    若所述出现次数大于预设出现次数的语句的个数小于预设的摘要语句数量,则将所述融合摘要候选集中所述出现次数大于预设出现次数的语句移存至所述目标文本的摘要结果集中,以更新所述融合摘要候选集;
    获取更新后的所述融合摘要候选集中每个语句的重要性值,并根据所述重要性值,对更新后的所述融合摘要候选集中的语句进行排序;
    按照更新后的所述融合摘要候选集中每个语句的排序,依次从更新后的所述融合摘要候选集中选择语句写入所述摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
  8. 一种摘要提取装置,其中,所述摘要提取装置包括:
    获取模块,用于获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
    第一摘要筛选模块,用于计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
    第二摘要筛选模块,用于计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
    第三摘要筛选模块,用于基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选 集;
    选择模块,用于分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
    摘要确定模块,用于统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
  9. 一种计算机设备,其中,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如下步骤:
    获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
    计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
    计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
    基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
    分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
    统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
  10. 根据权利要求9所述的计算机设备,其中,所述计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集,包括:
    统计所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数;
    根据所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数,计算所述语句集中每两个语句的句子相似度;
    基于TextRank算法,根据所述语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值,其中,所述第一重要性值用于表征语句在所述目标文本中的重要程度;
    根据所述语句集中每个语句的第一重要性值,从所述语句集中筛选出第一摘要候选集。
  11. 根据权利要求9所述的计算机设备,其中,所述计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集,包括:
    对所述语句集中的每个语句进行编码,得到所述语句集中的每个语句各自对应的语句向量;
    根据所述语句集中的每个语句各自对应的语句向量,计算所述语句集中每两个语句之间的余弦相似度;
    基于TextRank算法,根据所述语句集中每两个语句之间的余弦相似度,确定每个语句的第二重要性值,其中,所述第二重要性值用于表征语句在所述目标文本中的重要程度;
    根据所述语句集中每个语句的第二重要性值,从所述语句集中筛选出第二摘要候选集。
  12. 根据权利要求10所述的计算机设备,其中,所述基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集,包括:
    根据所述第一摘要候选集中每个语句的所述第一重要性值,对所述第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号;
    从所述第一摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集;
    将所述候选语句集中所述第一重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集;
    基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,其中,所述MMR值用于表征所述候选语句集中的语句与所述摘要候选集之间的相似程度;
    将所述MMR值最高的语句移存至所述摘要候选集,以更新所述摘要候选集和候选语句集;
    确定更新后的所述摘要候选集中的语句的数量是否达到预设语句个数;
    若更新后的所述摘要候选集中的语句的数量未达到预设语句个数,则执行步骤:基于所述MMR算法,基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值;
    若更新后的所述摘要候选集中的语句的数量达到预设语句个数,则将更新后的所述摘要候选集作为第三摘要候选集。
  13. 根据权利要求12所述的计算机设备,其中,所述基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,包括:
    对所述摘要候选集进行编码,得到所述摘要候选集对应的向量;
    分别对所述候选语句集中的每个语句进行编码,得到所述候选语句集中的每个语句各自对应的向量;
    计算所述摘要候选集对应的向量分别与所述候选语句集中的每个语句各自对应的向量之间的语义相似度;
    根据每个所述语义相似度和所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。
  14. 根据权利要求9至13中任一项所述的计算机设备,其中,所述根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集,包括:
    确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量;
    若所述出现次数大于预设出现次数的语句的个数大于或等于预设的摘要语句数量,则根据所述出现次数,对所述融合摘要候选集中的语句进行排序;
    按照所述融合摘要候选集中每个语句的排序,依次从所述融合摘要候选集中选择语句写入所述目标文本的摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
  15. 根据权利要求14所述的计算机设备,其中,所述确定所述出现次数大于预设出现次数的语句的个数是否大于或等于预设的摘要语句数量之后,还包括:
    若所述出现次数大于预设出现次数的语句的个数小于预设的摘要语句数量,则将所述融合摘要候选集中所述出现次数大于预设出现次数的语句移存至所述目标文本的摘要结果集中,以更新所述融合摘要候选集;
    获取更新后的所述融合摘要候选集中每个语句的重要性值,并根据所述重要性值,对更新后的所述融合摘要候选集中的语句进行排序;
    按照更新后的所述融合摘要候选集中每个语句的排序,依次从更新后的所述融合摘要候选集中选择语句写入所述摘要结果集中,直至所述摘要结果集中的语句的数量达到预设的摘要语句数量。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如下步骤:
    获取目标文本的语句集,其中,所述目标文本为待提取摘要的文本;
    计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集;
    计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集;
    基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集以及从所述第二摘要候选集中筛选出第四摘要候选集;
    分别从所述第一摘要候选集、第二摘要候选集、第三摘要候选集和第四摘要候选集中选择预设摘要语句数量的语句,以形成融合摘要候选集;
    统计所述融合摘要候选集中各语句的出现次数,并根据各语句的出现次数,从所述融合摘要候选集中筛选出所述目标文本的摘要结果集。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算所述语句集中每两个语句之间的句子相似度,并基于TextRank算法,根据所述句子相似度,从所述语句集中筛选出第一摘要候选集,包括:
    统计所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数;
    根据所述语句集中每两个语句的相同词的数量和所述语句集中每个语句包含的词的个数,计算所述语句集中每两个语句的句子相似度;
    基于TextRank算法,根据所述语句集中每两个语句之间的句子相似度,确定每个语句的第一重要性值,其中,所述第一重要性值用于表征语句在所述目标文本中的重要程度;
    根据所述语句集中每个语句的第一重要性值,从所述语句集中筛选出第一摘要候选集。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述计算所述语句集中每两个语句之间的余弦相似度,并基于TextRank算法,根据所述余弦相似度,从所述语句集中筛选出第二摘要候选集,包括:
    对所述语句集中的每个语句进行编码,得到所述语句集中的每个语句各自对应的语句向量;
    根据所述语句集中的每个语句各自对应的语句向量,计算所述语句集中每两个语句之间的余弦相似度;
    基于TextRank算法,根据所述语句集中每两个语句之间的余弦相似度,确定每个语句的第二重要性值,其中,所述第二重要性值用于表征语句在所述目标文本中的重要程度;
    根据所述语句集中每个语句的第二重要性值,从所述语句集中筛选出第二摘要候选集。
  19. 根据权利要求17所述的计算机可读存储介质,其中,所述基于最大边缘相关MMR算法和预设语句个数,从所述第一摘要候选集中筛选出第三摘要候选集,包括:
    根据所述第一摘要候选集中每个语句的所述第一重要性值,对所述第一摘要候选集中每个语句进行排序,并获取每个语句的排序编号;
    从所述第一摘要候选集中获取所述排序编号小于或等于预设的排序编号的语句,以形成候选语句集;
    将所述候选语句集中所述第一重要性值最高的语句移存至空白的摘要候选集,以更新所述摘要候选集和候选语句集;
    基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,其中,所述MMR值用于表征所述候选语句集中的语句与所述摘要候选集之间的相似程度;
    将所述MMR值最高的语句移存至所述摘要候选集,以更新所述摘要候选集和候选语句集;
    确定更新后的所述摘要候选集中的语句的数量是否达到预设语句个数;
    若更新后的所述摘要候选集中的语句的数量未达到预设语句个数,则执行步骤:基于 所述MMR算法,基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值;
    若更新后的所述摘要候选集中的语句的数量达到预设语句个数,则将更新后的所述摘要候选集作为第三摘要候选集。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述基于预设的MMR值计算公式,根据所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值,包括:
    对所述摘要候选集进行编码,得到所述摘要候选集对应的向量;
    分别对所述候选语句集中的每个语句进行编码,得到所述候选语句集中的每个语句各自对应的向量;
    计算所述摘要候选集对应的向量分别与所述候选语句集中的每个语句各自对应的向量之间的语义相似度;
    根据每个所述语义相似度和所述候选语句集中每个语句的第一重要性值,计算所述摘要候选集分别与所述候选语句集中每个语句各自对应的MMR值。
PCT/CN2020/112340 2020-02-27 2020-08-30 摘要提取方法、装置、设备及计算机可读存储介质 WO2021169217A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010125189.2A CN111507090A (zh) 2020-02-27 2020-02-27 摘要提取方法、装置、设备及计算机可读存储介质
CN202010125189.2 2020-02-27

Publications (1)

Publication Number Publication Date
WO2021169217A1 true WO2021169217A1 (zh) 2021-09-02

Family

ID=71868960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112340 WO2021169217A1 (zh) 2020-02-27 2020-08-30 摘要提取方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111507090A (zh)
WO (1) WO2021169217A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114595684A (zh) * 2022-02-11 2022-06-07 北京三快在线科技有限公司 一种摘要生成方法、装置、电子设备及存储介质
CN115438654A (zh) * 2022-11-07 2022-12-06 华东交通大学 文章标题生成方法、装置、存储介质及电子设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507090A (zh) * 2020-02-27 2020-08-07 平安科技(深圳)有限公司 摘要提取方法、装置、设备及计算机可读存储介质
CN112307738B (zh) * 2020-11-11 2024-06-14 北京沃东天骏信息技术有限公司 用于处理文本的方法和装置
CN114203169A (zh) * 2022-01-26 2022-03-18 合肥讯飞数码科技有限公司 一种语音识别结果确定方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011150515A (ja) * 2010-01-21 2011-08-04 Nippon Telegr & Teleph Corp <Ntt> テキスト要約装置、テキスト要約方法及びテキスト要約プログラム
CN109977219A (zh) * 2019-03-19 2019-07-05 国家计算机网络与信息安全管理中心 基于启发式规则的文本摘要自动生成方法及装置
CN110362674A (zh) * 2019-07-18 2019-10-22 中国搜索信息科技股份有限公司 一种基于卷积神经网络的微博新闻摘要抽取式生成方法
CN110837556A (zh) * 2019-10-30 2020-02-25 深圳价值在线信息科技股份有限公司 摘要生成方法、装置、终端设备及存储介质
CN111507090A (zh) * 2020-02-27 2020-08-07 平安科技(深圳)有限公司 摘要提取方法、装置、设备及计算机可读存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7607083B2 (en) * 2000-12-12 2009-10-20 Nec Corporation Test summarization using relevance measures and latent semantic analysis
CN105868175A (zh) * 2015-12-03 2016-08-17 乐视网信息技术(北京)股份有限公司 摘要生成方法及装置
CN107766419B (zh) * 2017-09-08 2021-08-31 广州汪汪信息技术有限公司 一种基于阈值去噪的TextRank文档摘要方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011150515A (ja) * 2010-01-21 2011-08-04 Nippon Telegr & Teleph Corp <Ntt> テキスト要約装置、テキスト要約方法及びテキスト要約プログラム
CN109977219A (zh) * 2019-03-19 2019-07-05 国家计算机网络与信息安全管理中心 基于启发式规则的文本摘要自动生成方法及装置
CN110362674A (zh) * 2019-07-18 2019-10-22 中国搜索信息科技股份有限公司 一种基于卷积神经网络的微博新闻摘要抽取式生成方法
CN110837556A (zh) * 2019-10-30 2020-02-25 深圳价值在线信息科技股份有限公司 摘要生成方法、装置、终端设备及存储介质
CN111507090A (zh) * 2020-02-27 2020-08-07 平安科技(深圳)有限公司 摘要提取方法、装置、设备及计算机可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114595684A (zh) * 2022-02-11 2022-06-07 北京三快在线科技有限公司 一种摘要生成方法、装置、电子设备及存储介质
CN115438654A (zh) * 2022-11-07 2022-12-06 华东交通大学 文章标题生成方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN111507090A (zh) 2020-08-07

Similar Documents

Publication Publication Date Title
WO2021169217A1 (zh) 摘要提取方法、装置、设备及计算机可读存储介质
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
WO2021114810A1 (zh) 基于图结构的公文推荐方法、装置、计算机设备及介质
WO2021164231A1 (zh) 公文摘要提取方法、装置、设备及计算机可读存储介质
US11669795B2 (en) Compliance management for emerging risks
CN109376273B (zh) 企业信息图谱构建方法、装置、计算机设备及存储介质
TW202029079A (zh) 異常群體識別方法及裝置
CN110263311B (zh) 一种网络页面的生成方法及设备
CN111814770A (zh) 一种新闻视频的内容关键词提取方法、终端设备及介质
CN111460153A (zh) 热点话题提取方法、装置、终端设备及存储介质
WO2017206376A1 (zh) 搜索方法、装置及非易失性计算机存储介质
US10311093B2 (en) Entity resolution from documents
CN108959259B (zh) 新词发现方法及系统
WO2021189845A1 (zh) 时间序列异常点的检测方法、装置、设备及可读存储介质
CN111767713A (zh) 关键词的提取方法、装置、电子设备及存储介质
CN114547257B (zh) 类案匹配方法、装置、计算机设备及存储介质
KR101973949B1 (ko) 목적에 따라 비식별화된 데이터를 최적화하는 방법 및 장치
JP2019204246A (ja) 学習データ作成方法及び学習データ作成装置
US20210042363A1 (en) Search pattern suggestions for large datasets
CN108875050B (zh) 面向文本的数字取证分析方法、装置和计算机可读介质
WO2022116444A1 (zh) 文本分类方法、装置、计算机设备和介质
CN104199924B (zh) 选择具有快照关系的网络表格的方法及装置
WO2019136799A1 (zh) 数据离散化方法、装置、计算机设备及存储介质
CN115544214B (zh) 一种事件处理方法、设备及计算机可读存储介质
CN110738048A (zh) 一种关键词提取方法、装置及终端设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921888

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921888

Country of ref document: EP

Kind code of ref document: A1