CN110674283A - Intelligent extraction method and device of text abstract, computer equipment and storage medium - Google Patents

Intelligent extraction method and device of text abstract, computer equipment and storage medium Download PDF

Info

Publication number
CN110674283A
CN110674283A CN201910752285.7A CN201910752285A CN110674283A CN 110674283 A CN110674283 A CN 110674283A CN 201910752285 A CN201910752285 A CN 201910752285A CN 110674283 A CN110674283 A CN 110674283A
Authority
CN
China
Prior art keywords
feature
characteristic
sentences
words
weighting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910752285.7A
Other languages
Chinese (zh)
Inventor
杨春春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN201910752285.7A priority Critical patent/CN110674283A/en
Publication of CN110674283A publication Critical patent/CN110674283A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for intelligently extracting a text abstract, a computer device and a storage medium, wherein the method comprises the following steps: acquiring a plurality of characteristic sentences from a plurality of texts, and dividing characteristic words for each characteristic sentence to obtain a plurality of characteristic words; classifying the plurality of feature words into different clusters through clustering analysis; classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters; and extracting a fixed number of characteristic sentences from each class cluster to form an overall abstract of the plurality of texts. Wherein the clustering process comprises: respectively performing word vector representation on the plurality of feature words to obtain a plurality of feature vectors; weighting each feature vector according to the importance degree to obtain a plurality of weighted vectors; calculating the similarity between every two weighted vectors; and performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers.

Description

Intelligent extraction method and device of text abstract, computer equipment and storage medium
Technical Field
The invention relates to the technical field of data mining, in particular to an intelligent extraction method and device of text summaries, computer equipment and a storage medium.
Background
Automatic text summarization is a difficult task in natural language processing, and essentially, text summarization is an information filter, and output text is much less than input text, but contains major information. Text summaries can be divided into single text summaries, which are the basis of the latter, and multiple text summaries, according to the number of texts, but the latter are not just simple superpositions of the results of the former. The former is often applied to filtering news information, while the latter has great potential in search engines, and the difficulty is increased.
The abstracts extracted by the traditional multi-text abstract algorithm have high redundancy, cannot well reflect the overall structure content of all texts, and easily causes the defects of loss of a text theme center, low text theme coverage, poor continuity, long time consumption and the like.
Disclosure of Invention
The invention aims to provide a method and a device for intelligently extracting a text abstract, a computer device and a storage medium, which are used for solving the problems in the prior art.
In order to achieve the above object, the present invention provides an intelligent extraction method of text summaries, comprising:
acquiring a plurality of characteristic sentences from a plurality of texts, and dividing characteristic words for each characteristic sentence to obtain a plurality of characteristic words;
classifying the plurality of feature words into different clusters through clustering analysis;
classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters;
and extracting a fixed number of characteristic sentences from each class cluster to form an overall abstract of the plurality of texts.
According to the intelligent extraction algorithm provided by the invention, the step of classifying the plurality of feature words into different clusters through cluster analysis comprises the following steps:
respectively performing word vector representation on the plurality of feature words to obtain a plurality of feature vectors;
weighting each feature vector according to the importance degree to obtain a plurality of weighted vectors;
calculating the similarity between every two weighted vectors;
and performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers.
According to the intelligent extraction algorithm provided by the invention, the step of weighting each feature vector according to the importance degree to obtain a plurality of weighted vectors comprises the following steps:
calculating a first weight of the feature vector based on a Tf-idf algorithm;
calculating a second weight of the feature vector based on an appearance position of the feature word in a feature sentence;
and multiplying the feature vector by the first weight and the second weight in sequence to obtain the weighted vector.
According to the intelligent extraction algorithm provided by the invention, the step of classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters comprises the following steps:
marking target characteristic sentences to which the target characteristic words belong;
and dividing the target characteristic sentences into class clusters corresponding to the target characteristic words.
According to the intelligent extraction algorithm provided by the invention, the step of extracting a fixed number of characteristic sentences from each class cluster to form an overall abstract of the plurality of texts comprises the following steps:
sequencing all the characteristic sentences in each class cluster according to the sequence of the importance degrees from high to low;
and extracting a fixed number of characteristic sentences ranked at the top from each class cluster and gathering the characteristic sentences into a text abstract.
In order to achieve the above object, the present invention further provides an intelligent extraction device for text summaries, comprising:
the characteristic word acquisition module is suitable for acquiring a plurality of characteristic sentences from a plurality of texts and dividing each characteristic sentence into a plurality of characteristic words;
the cluster analysis module is suitable for classifying the characteristic words into different clusters through cluster analysis;
the class cluster dividing module is suitable for classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters;
and the collection module is suitable for extracting a fixed number of characteristic sentences from each class cluster to form an integral abstract of the plurality of texts.
The invention provides an intelligent extraction device, wherein the cluster analysis module comprises:
the vector characterization submodule is suitable for performing word vector characterization on the plurality of feature words respectively to obtain a plurality of feature vectors;
the weighting submodule is suitable for weighting each feature vector according to the importance degree to obtain a plurality of weighting vectors;
the similarity submodule is suitable for calculating the similarity between every two weighted vectors;
and the cluster dividing submodule is suitable for performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers.
According to the intelligent extraction device provided by the invention, the weighting submodule comprises:
a first weighting unit adapted to calculate a first weight of the feature vector based on a Tf-idf algorithm;
a second weighting unit adapted to calculate a second weight of the feature vector based on an appearance position of the feature word in the feature sentence;
and the weighting vector generating unit is used for multiplying the characteristic vector by the first weight and the second weight in sequence to obtain the weighting vector.
To achieve the above object, the present invention further provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
To achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the above method.
The intelligent extraction method, the intelligent extraction device, the computer equipment and the storage medium for the text abstract are suitable for comprehensively extracting the abstract of a plurality of texts. The method uniformly performs sentence segmentation and word segmentation on a plurality of texts, performs word vector representation on the obtained feature words, and sets weights for the corresponding feature vectors according to the importance degree of each feature word, thereby generating the weighted vector corresponding to each feature word. And then calculating the similarity of all the weighted vectors and carrying out clustering operation to obtain the number of the clustering centers. Dividing the characteristic sentences into different clusters according to the quantity of the clustering centers, then respectively extracting a corresponding quantity of characteristic sentences from each cluster according to the preset quantity of abstract sentences, and collecting and integrating the characteristic sentences into abstracts of a plurality of texts. The invention can effectively improve the quality of extracting the text abstract from a plurality of texts, ensure the content of the abstract to be more comprehensive and avoid the extraction of the content irrelevant to the central sentence.
Drawings
FIG. 1 is a flowchart of a first embodiment of an intelligent extraction method of the present invention;
FIG. 2 is a schematic diagram of program modules of a first embodiment of the intelligent extraction device according to the present invention;
fig. 3 is a schematic diagram of a hardware structure of a first embodiment of the intelligent extraction device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The intelligent extraction method, the intelligent extraction device, the computer equipment and the storage medium for the text abstract are suitable for extracting the overall abstract of a plurality of texts. The method uniformly performs sentence segmentation and word segmentation on a plurality of texts, performs word vector representation on the obtained feature words, and sets weights for the corresponding feature vectors according to the importance degree of each feature word, thereby generating the weighted vector corresponding to each feature word. And then calculating the similarity of all the weighted vectors and carrying out clustering operation to obtain the number of the clustering centers. Dividing the characteristic sentences into different clusters according to the quantity of the clustering centers, then respectively extracting a corresponding quantity of characteristic sentences from each cluster according to the preset quantity of abstract sentences, and collecting and integrating the characteristic sentences into the overall abstract of a plurality of texts. The invention can effectively improve the quality of extracting the text abstract from a plurality of texts, ensure the content of the abstract to be more comprehensive and avoid the extraction of the content irrelevant to the central sentence.
Example one
Referring to fig. 1, the present embodiment provides an intelligent text summarization extraction method, which specifically includes the following steps:
s1, obtaining a plurality of characteristic sentences from the plurality of texts, and dividing each characteristic sentence into a plurality of characteristic words.
The method is particularly suitable for intelligent abstract extraction of multiple texts. For example, there are three texts, which are respectively called a first text, a second text and a third text, and the method firstly divides the three texts into feature sentences respectively and further divides feature words on the basis of the feature sentences. For example, the first text, the second text and the third text respectively contain a number a, b and c of feature sentences, which are labeled, for example, as feature sentences 11,12, … 1a, 21,22, …, 2b, 31,32, … 3c, respectively. Each characteristic sentence comprises a plurality of characteristic words, for example, for the sentence "i love my home", the sentence can be divided into several characteristic words of "i", "love", "my", "home", respectively. The standard for dividing feature words in the Chinese language is mainly based on the grammatical function of words, such as nouns, quantifiers, predicates, verbs, conjunctions, prepositions and the like.
The method carries out pretreatment such as stop word removal, synonym merging and the like on the characteristic sentences and the characteristic words, removes sentences or phrases which have no practical significance, such as simple exclamation sentences, sound-making words, conjunctions, turning words, adverbs and the like which have no practical significance, and ensures the simplicity of the characteristic sentences and the characteristic words to the maximum extent.
The invention marks the positions of the characteristic words to clarify the specific positions of each characteristic word and mark the sentences in which text the characteristic words specifically exist. The specific form of the mark is not particularly limited, and those skilled in the art can use any recognizable symbol such as letters, numbers, english words, combinations of letters and numbers, and the like. For example, if the feature word "home" occurs in the first sentence of the first text, the second sentence of the second text and the third sentence of the third text, respectively, a marking (11,22,33) can be added to the feature word "home".
And S2, classifying the plurality of characteristic words into different clusters through clustering analysis.
Clustering analysis is a process of classifying data into different clusters, objects in the same cluster have great similarity, and objects in different clusters have great difference. The purpose of this step is to classify different feature words to obtain different class clusters, and the data objects contained in each class cluster have similar characteristics and are different from the data objects in other class clusters. After the characteristic words are divided into different clusters, the data objects are extracted from different aspects and combined into a new abstract, so that the aims of concise and comprehensive reflection of the subjects of a plurality of documents are fulfilled. The process of cluster analysis is described in detail below.
S21: and respectively performing word vector representation on the plurality of feature words to obtain a plurality of feature vectors.
Each feature word can be converted into a vector form by using a space vector model, for example, each word in a sentence can be represented by using a word2vec word vector model, which has the advantages that the word2vec word vector model reduces the input dimensionality on one hand, and on the other hand, compared with the traditional one-hot-encoder or topic model, word vectors trained by using the word2vec model fully utilize the context of the words and provide richer semantic information. Each basic word after word2vec model training is characterized in a vector form which can be understood by a computer, such as [0.792, -0.177, -0.107, … … ], and the invention refers to the feature words characterized in the vector form as feature vectors. It is certain that each feature word has a unique feature vector corresponding to it.
S22: and weighting each feature vector according to the importance degree to obtain a plurality of weighted vectors.
Weighting the feature vectors is an important technical means adopted by the invention to improve the abstract extraction effect. The weighting of the feature vector mainly comprises two aspects, wherein the first aspect is that the weighting of the first weight is carried out according to the occurrence frequency of the feature words, and the second aspect is that the weighting of the second weight is carried out according to the occurrence positions of the feature words.
The weighting of the first weights may utilize an existing TF-IDF algorithm. The main idea of TFIDF is: if a word or phrase appears frequently in one document, TF, and rarely appears in other documents, the word or phrase is considered to have a good classification capability and is suitable for classification. TFIDF is actually: TF, IDF, TF term Frequency (termfequency), IDF Inverse file Frequency (Inverse Document Frequency). TF represents the frequency with which terms appear in document d.
If the total number of words in a document is 100 and the word "cow" appears 3 times, the word frequency of the word "cow" in the document is 3/100-0.03. One way to calculate the document frequency (IDF) is to divide the total number of documents contained in the document set by the term "cow" to determine how many documents appear. Therefore, if the term "cow" appears in 1,000 documents and the total number of documents is 10,000,000, the reverse file frequency is lg (10,000,000/1,000) ═ 4. The fraction of the final TF-IDF was 0.03 × 4 — 0.12.
The weighting of the second weight is adjusted according to the position where the feature word appears in the feature sentence. In general, the second weight is highest when the feature word occurs at the beginning position of the sentence, the second weight is lowest when the feature word occurs at the middle position of the sentence, and the second weight is lowest when the feature word occurs at the end position of the sentence. For example, the second weight is 1 when a feature word appears at the beginning of a sentence, 0.6 when a feature word appears in a sentence, and 0.4 when a feature word appears at the end of a sentence.
Finally, the feature vector is multiplied by the first weight and the second weight respectively to obtain a weighted vector.
And S23, calculating the similarity between every two weighted vectors.
In the invention, a plurality of feature words are subjected to word vector representation and weighting to obtain a plurality of weighted vectors, and the similarity of any two weighted vectors is calculated for measuring the similarity between the two weighted vectors. In the prior art, there are many methods for calculating similarity, including an euclidean distance calculation method, a cosine similarity calculation method, a hash-based similarity calculation method, and the like, which are not listed here.
And S24, performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers.
In this step, a feature similarity matrix is constructed according to the feature similarities obtained in step S23, where each column in the similarity matrix represents a feature vector and each row represents a feature sentence. And clustering the similarity matrix, and performing continuous loop iteration through the distance values among the weighted vectors to finally form p clustering centers so as to divide the feature into p clusters.
And S3, classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters.
On the basis that the feature words are divided into p class clusters, the feature sentences in which the feature words are located are correspondingly divided into p class clusters through marks contained in the feature words.
For example, if the feature word "home" occurs in the first sentence of the first text, the second sentence of the second text and the third sentence of the third text, respectively, a marking (11,22,33) can be added to the feature word "home". In the process of dividing the class clusters, if the feature word "home" is divided into the first class clusters, the feature sentences (11,22,33) are correspondingly divided into the first class clusters.
Since each feature sentence includes a plurality of feature words, and the plurality of feature words may belong to different class clusters, it is expected that the same feature sentence may be divided into different class clusters, that is, a phenomenon of overlapping feature sentences in different class clusters may occur. The present invention, in dividing the class clusters, will allow the existence of such overlap first for the convenience of statistics. In the subsequent steps, the present invention will further describe how to eliminate the overlap.
And S4, extracting a fixed number of characteristic sentences from each class cluster to form an overall abstract of the plurality of texts.
The method is used for extracting a corresponding number of characteristic sentences from the heterogeneous clusters so as to assemble and form the document abstract. Specifically, a preset abstract number is obtained first, where the number refers to the number of all feature sentences included in the abstract. And then, the number of all the characteristic sentences is averagely distributed to each class cluster to obtain the number of the characteristic sentences needing to be extracted in each class cluster. For example, if the number of the abstracts is 10% of the total sentences of the text, 10 sentences are needed for one 100-sentence text, and then the abstracts are extracted from each text. If the above steps are divided into 5 categories, then 2 sentences may be selected for each category on average.
The present invention refers to allowing duplicate feature statements in different clusters in step S3, which is for statistical convenience, and eliminates possible defects due to duplication in subsequent steps. One important elimination method is to sort the feature sentences in each class cluster according to a first weight or a second weight or a product of the first weight and the second weight, and arrange the feature sentences in a high-to-low order. According to the foregoing discussion, the same feature sentence may be divided into different clusters because of different feature words contained therein, and since the feature words corresponding to the feature sentences in each cluster are different, the ordering of the same feature sentence in different clusters is largely different. For example, the feature sentence "i love my home", because the feature word "home" is divided into a first kind of clusters; meanwhile, the characteristic sentence is divided into a second cluster because of the characteristic word 'love'. Assuming that the first weight of the feature word "home" is 0.6 and the first weight of the feature word "love" is 0.3, the feature sentence "i love my home" must be ranked earlier in the first cluster than in the second cluster, if ranked according to the first weight value. Through different sorting methods, the problem of repeated extraction of the uniform characteristic sentences can be avoided to a great extent.
Further, the invention carries out repeated sentence deleting processing on the characteristic sentences extracted from each class cluster, and only one of the repeated characteristic sentences is reserved and other characteristic sentences are deleted for the repeated characteristic sentences with two or more than two sentences. Through this step, the present invention can eliminate the phenomenon that overlap of feature sentences occurs in different clusters, which occurs when the clusters are divided at step S3. Finally, all the characteristic sentences processed through the steps S1-S4 are collected together to form an overall abstract of a plurality of documents.
It should be noted that, the method for sorting different feature sentences in the same cluster is defined in the foregoing, and the invention is not particularly limited to the problem of sorting among feature sentences extracted from different clusters. For example, statements 1 and 2 belong to a first cluster, statements 3 and 4 belong to a second cluster, and statements 5 and 6 belong to a third cluster. Because the three clusters are different from each other greatly, no matter the statement in the cluster is placed in front, the influence on the whole semantics is not great. For example, the sequences may be arranged in the order of sentence 1, sentence 2, sentence 3, sentence 4, sentence 5, and sentence 6, the sequences may be arranged in the order of sentence 3, sentence 4, sentence 1, sentence 2, sentence 5, and sentence 6, the sequences may be arranged in the order of sentence 5, sentence 6, sentence 3, sentence 4, sentence 1, and sentence 2, and so on. Of course, if necessary, the class clusters may also be sorted according to the weights of the feature words contained in the class clusters, and the weights of the feature words may be the product of the first weight and the second weight in the foregoing. And arranging the class cluster where the characteristic word with the large weight value is positioned in the front, and arranging the class cluster where the characteristic word with the small weight value is positioned in the rear. Correspondingly, the feature sentences extracted from the different clusters are also sorted according to the sequence of the clusters.
Through the steps, the finally obtained document abstract has the advantages of clear representation significance, accurate summarization, comprehensive summarization and the like, the whole extraction process is quick and efficient, and the quality of abstract extraction is effectively improved.
Referring still to fig. 2, an intelligent text summarization extraction device is shown, in this embodiment, the intelligent extraction device 10 may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to implement the present invention and implement the intelligent extraction method. The program modules referred to herein are a series of computer program instruction segments that perform certain functions and are more suitable than the program itself for describing the execution of the intelligent extraction device 10 on a storage medium. The following description will specifically describe the functions of the program modules of the present embodiment:
the feature word obtaining module 11 is adapted to obtain a plurality of feature sentences from a plurality of texts, and divide feature words for each feature sentence to obtain a plurality of feature words. The invention marks the positions of the characteristic words to clarify the specific positions of each characteristic word and marks the sentence in which the characteristic word exists specifically.
The cluster analysis module 12 is adapted to classify the plurality of feature words into different clusters through cluster analysis. Wherein the data objects contained in each class cluster have similar characteristics while being distinct from the data objects in the other class clusters.
And the class cluster dividing module 13 is adapted to classify the feature sentences to which each feature word belongs into corresponding class clusters. On the basis that the feature words are divided into p class clusters, the feature sentences in which the feature words are located are correspondingly divided into p class clusters through marks contained in the feature words.
And the collection module 14 is suitable for extracting a fixed number of characteristic sentences from each class cluster to form an overall abstract of the plurality of texts. Specifically, a preset abstract number is obtained first, where the number refers to the number of all feature sentences included in the abstract. And then, the number of all the characteristic sentences is averagely distributed to each class cluster to obtain the number of the characteristic sentences needing to be extracted in each class cluster.
Further, the cluster analysis module 12 includes:
the vector characterization submodule 121 is adapted to perform word vector characterization on the plurality of feature words respectively to obtain a plurality of feature vectors; each feature word has a unique feature vector corresponding to it.
A weighting submodule 122, adapted to weight each of the feature vectors according to the degree of importance to obtain a plurality of weighted vectors; the weighting of the feature vector mainly comprises two aspects, wherein the first aspect is that the weighting of the first weight is carried out according to the occurrence frequency of the feature words, and the second aspect is that the weighting of the second weight is carried out according to the occurrence positions of the feature words.
A similarity submodule 123 adapted to calculate a similarity between each two weighted vectors;
and the cluster dividing submodule 124 is suitable for performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers. And constructing a feature similarity matrix according to the similarity obtained by the similarity submodule 123, wherein each column in the similarity matrix represents a feature vector, and each row represents a feature statement. And clustering the similarity matrix, and performing continuous loop iteration through the distance values among the weighted vectors to finally form p clustering centers so as to divide the feature into p clusters.
Further, the weighting sub-module 122 includes:
a first weight unit 1221 adapted to calculate a first weight of the feature vector based on the Tf-idf algorithm;
a second weighting unit 1222 adapted to calculate a second weight of the feature vector based on the occurrence position of the feature word in the feature sentence;
a weighted vector generating unit 1223, adapted to multiply the feature vector by the first weight and the second weight in sequence to obtain the weighted vector.
The embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device 20 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 22, which may be communicatively coupled to each other via a system bus, as shown in FIG. 3. It is noted that fig. 3 only shows the computer device 20 with components 21-22, but it is to be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.
In the present embodiment, the memory 21 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 20, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 20. Of course, the memory 21 may also include both internal and external storage devices of the computer device 20. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed in the computer device 20, such as the program codes of the data synchronization apparatus 10 in the first embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data, for example, to execute the intelligent extraction device 10, so as to implement the intelligent extraction method according to the first embodiment.
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing the intelligent extraction apparatus 10, and when executed by a processor, the intelligent extraction method of the first embodiment is implemented.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example" or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An intelligent extraction method of text summaries is characterized by comprising the following steps:
acquiring a plurality of characteristic sentences from a plurality of texts, and dividing characteristic words for each characteristic sentence to obtain a plurality of characteristic words;
classifying the plurality of feature words into different clusters through clustering analysis;
classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters;
and extracting a fixed number of characteristic sentences from each class cluster to form an overall abstract of the plurality of texts.
2. The intelligent extraction algorithm according to claim 1, wherein the step of classifying the plurality of feature words into different clusters by cluster analysis comprises:
respectively performing word vector representation on the plurality of feature words to obtain a plurality of feature vectors;
weighting each feature vector according to the importance degree to obtain a plurality of weighted vectors;
calculating the similarity between every two weighted vectors;
and performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers.
3. The intelligent decimation algorithm according to claim 2, wherein said step of weighting each of said feature vectors according to their importance levels to obtain a plurality of weighted vectors comprises:
calculating a first weight of the feature vector based on a Tf-idf algorithm;
calculating a second weight of the feature vector based on an appearance position of the feature word in a feature sentence;
and multiplying the feature vector by the first weight and the second weight in sequence to obtain the weighted vector.
4. The intelligent extraction algorithm according to claim 2 or 3, wherein the step of classifying the feature sentences to which each feature word belongs into the corresponding class cluster comprises:
marking target characteristic sentences to which the target characteristic words belong;
and dividing the target characteristic sentences into class clusters corresponding to the target characteristic words.
5. The intelligent extraction algorithm according to claim 4, wherein the step of extracting a fixed number of feature sentences from each cluster class to form the overall abstract of the plurality of texts comprises:
sequencing all the characteristic sentences in each class cluster according to the sequence of the importance degrees from high to low;
and extracting a fixed number of characteristic sentences ranked at the top from each class cluster and gathering the characteristic sentences into a text abstract.
6. An intelligent extraction device for text summaries, characterized by comprising:
the characteristic word acquisition module is suitable for acquiring a plurality of characteristic sentences from a plurality of texts and dividing each characteristic sentence into a plurality of characteristic words;
the cluster analysis module is suitable for classifying the characteristic words into different clusters through cluster analysis;
the class cluster dividing module is suitable for classifying the characteristic sentences to which each characteristic word belongs into corresponding class clusters;
and the collection module is suitable for extracting a fixed number of characteristic sentences from each class cluster to form an integral abstract of the plurality of texts.
7. The intelligent extraction device according to claim 6, wherein the cluster analysis module comprises:
the vector characterization submodule is suitable for performing word vector characterization on the plurality of feature words respectively to obtain a plurality of feature vectors;
the weighting submodule is suitable for weighting each feature vector according to the importance degree to obtain a plurality of weighting vectors;
the similarity submodule is suitable for calculating the similarity between every two weighted vectors;
and the cluster dividing submodule is suitable for performing clustering operation according to the similarity to obtain the number of clustering centers, and dividing the plurality of feature words into a plurality of clusters according to the number of clustering centers.
8. The intelligent extraction device according to claim 7, wherein the weighting submodule comprises:
a first weighting unit adapted to calculate a first weight of the feature vector based on a Tf-idf algorithm;
a second weighting unit adapted to calculate a second weight of the feature vector based on an appearance position of the feature word in the feature sentence;
and the weighting vector generating unit is used for multiplying the characteristic vector by the first weight and the second weight in sequence to obtain the weighting vector.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 5 are implemented by the processor when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
CN201910752285.7A 2019-08-15 2019-08-15 Intelligent extraction method and device of text abstract, computer equipment and storage medium Pending CN110674283A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910752285.7A CN110674283A (en) 2019-08-15 2019-08-15 Intelligent extraction method and device of text abstract, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910752285.7A CN110674283A (en) 2019-08-15 2019-08-15 Intelligent extraction method and device of text abstract, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110674283A true CN110674283A (en) 2020-01-10

Family

ID=69075339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910752285.7A Pending CN110674283A (en) 2019-08-15 2019-08-15 Intelligent extraction method and device of text abstract, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110674283A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347758A (en) * 2020-11-06 2021-02-09 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN114386390A (en) * 2021-11-25 2022-04-22 马上消费金融股份有限公司 Data processing method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101739426A (en) * 2008-11-13 2010-06-16 北京大学 Method and device for generating multi-document summary
CN108875049A (en) * 2018-06-27 2018-11-23 中国建设银行股份有限公司 text clustering method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101739426A (en) * 2008-11-13 2010-06-16 北京大学 Method and device for generating multi-document summary
CN108875049A (en) * 2018-06-27 2018-11-23 中国建设银行股份有限公司 text clustering method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邓箴等: ""基于词汇链的多文档自动文摘研究"", 《计算机与应用化学》, pages 1 - 3 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347758A (en) * 2020-11-06 2021-02-09 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN112347758B (en) * 2020-11-06 2024-05-17 中国平安人寿保险股份有限公司 Text abstract generation method and device, terminal equipment and storage medium
CN114386390A (en) * 2021-11-25 2022-04-22 马上消费金融股份有限公司 Data processing method and device, computer equipment and storage medium
CN114386390B (en) * 2021-11-25 2022-12-06 马上消费金融股份有限公司 Data processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106649768B (en) Question-answer clarification method and device based on deep question-answer
US9323794B2 (en) Method and system for high performance pattern indexing
Wan et al. Exploiting neighborhood knowledge for single document summarization and keyphrase extraction
EP2092419B1 (en) Method and system for high performance data metatagging and data indexing using coprocessors
CN109471933B (en) Text abstract generation method, storage medium and server
EP3016002A1 (en) Non-factoid question-and-answer system and method
JP6231668B2 (en) Keyword expansion method and system and classification corpus annotation method and system
CN106776574B (en) User comment text mining method and device
CN109948121A (en) Article similarity method for digging, system, equipment and storage medium
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
US10810245B2 (en) Hybrid method of building topic ontologies for publisher and marketer content and ad recommendations
US7555428B1 (en) System and method for identifying compounds through iterative analysis
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN105320646A (en) Incremental clustering based news topic mining method and apparatus thereof
CN106777236B (en) Method and device for displaying query result based on deep question answering
Wang et al. How preprocessing affects unsupervised keyphrase extraction
CN111291177A (en) Information processing method and device and computer storage medium
Weerasinghe et al. Feature Vector Difference based Authorship Verification for Open-World Settings.
CN110674283A (en) Intelligent extraction method and device of text abstract, computer equipment and storage medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN111639250B (en) Enterprise description information acquisition method and device, electronic equipment and storage medium
CN112417101A (en) Keyword extraction method and related device
CN109344397B (en) Text feature word extraction method and device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination