CN113377949A - Method and device for generating abstract of target object - Google Patents

Method and device for generating abstract of target object Download PDF

Info

Publication number
CN113377949A
CN113377949A CN202010161869.XA CN202010161869A CN113377949A CN 113377949 A CN113377949 A CN 113377949A CN 202010161869 A CN202010161869 A CN 202010161869A CN 113377949 A CN113377949 A CN 113377949A
Authority
CN
China
Prior art keywords
classification
similarity
sentences
key
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010161869.XA
Other languages
Chinese (zh)
Inventor
薛悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010161869.XA priority Critical patent/CN113377949A/en
Publication of CN113377949A publication Critical patent/CN113377949A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for generating an abstract of a target object, and relates to the technical field of computers. One embodiment of the method comprises: extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object, clustering the key sentences to obtain basic classification of the target objects under dimensionality; calculating the similarity between first classification groups included in basic classifications belonging to different target objects so as to cluster the first classification groups to obtain a final classification under the dimension of a target area; and calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score. The implementation mode reduces the occurrence of repeated sentences, generates high-quality and differential abstracts and better highlights the characteristics of the target object.

Description

Method and device for generating abstract of target object
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for generating an abstract of a target object.
Background
In order to enable a user to quickly and comprehensively acquire characteristics and related information of a target object, an internet website usually adds text description information to the target object to show the target object to the user. This textual description is referred to herein as an abstract. In the prior art, the abstract generation mode generally uses a TextRank algorithm to extract an abstract of a target object, or divides comment data of the target object by a user, counts sentences with high occurrence frequency, and then assembles the sentences with high occurrence frequency.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the abstract generated by the method is too rigid, the sentence homogenization problem is serious, the sentence length cannot be controlled, and some sentences which are very subjective or contain specific words can be extracted.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for generating an abstract of a target object, where a key sentence is first clustered in a target object dimension, and then a primary clustering result is secondarily clustered in a target area dimension, and an appropriate key sentence is selected as an abstract in combination with an inverse document frequency, so that occurrence of repeated sentences is reduced, high-quality and differentiated abstract is generated, and characteristics of the target object are better highlighted.
To achieve the above object, according to an aspect of the embodiments of the present invention, a method for generating a summary of a target object is provided.
The abstract generation method of the target object of the embodiment of the invention comprises the following steps: extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object to obtain a first similarity, clustering the key sentences according to the first similarity to obtain a basic classification of the target objects under dimensionality; calculating the similarity between first classification groups included in basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity to obtain a final classification under a target region dimension; and calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score.
Optionally, calculating an inverse document frequency of the second classification group included in the final classification to the overall classification includes: counting the sum of the number of the first classification groups contained in the basic classification to obtain the number of the full-part classification groups; counting the sum of the occurrence times of all key sentences in the second classification group to obtain the occurrence times of the second classification group; and taking the number of the full part class groups as a numerator and adding 1 to the occurrence times of the second class group as a denominator to perform logarithmic operation to obtain the inverse document frequency.
Optionally, calculating the score of the second classification group according to the inverse document frequency and the set weight includes: and according to the set weight, weighting and adding the frequency of the inverse document and the occurrence times of the second classification group to obtain the score of the second classification group.
Optionally, selecting a key sentence from the final classification according to the score includes: according to the value of the score, sorting the second classification groups contained in the final classification in a reverse order to select the first K second classification groups; wherein K is a positive integer; and respectively sorting the key sentences belonging to the first K second classification groups according to the occurrence times of the key sentences in the second classification groups, and respectively selecting the key sentences with the highest occurrence times in the first K second classification groups.
Optionally, extracting key sentences of comment data of a plurality of target objects in the target area includes: each sentence contained in the comment data is regarded as a node, the similarity among the nodes is calculated, and a node connection graph is constructed according to the similarity; and iteratively calculating the weight of the node according to the node connection graph and the similarity until the weight is converged, and selecting a sentence corresponding to the node with the highest weight in convergence as a key sentence.
Optionally, before the step of extracting key sentences of comment data of a plurality of target objects in the target area, the method further includes: preprocessing the plurality of pieces of comment data respectively; the preprocessing comprises filtering processing and merging processing which are carried out according to a set first filtering rule, and the merging processing comprises the following steps: respectively carrying out syntactic analysis on sentences contained in the comment data to obtain syntactic components of the sentences; taking a set symbol as a segmentation identifier, judging the syntactic structure of a first sentence after the symbol, and upwards combining sentences with non-major-minor structures and non-central structures; the method for extracting the key sentences of the comment data of the plurality of target objects in the target area comprises the following steps: and extracting key sentences of the preprocessed comment data.
Optionally, before the step of calculating the similarity between the first classification groups included in the basic classifications assigned to different target objects, the method further comprises: performing sentiment analysis on key sentences contained in the basic classifications corresponding to the target objects respectively; filtering the key sentences according to emotion analysis results and a set second filtering rule to obtain optimized classification of the target object under dimensionality; the second filtering rule is used for reserving a first classification group with the quantity of the key sentences with positive emotions larger than that of the key sentences with negative emotions, and the reserved key sentences are the key sentences with positive emotions; the calculating the similarity between the first classification groups included in the basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity, comprising: and calculating the similarity among the classification groups contained in the optimized classifications belonging to different target objects, and clustering the classification groups contained in the optimized classifications according to the obtained similarity.
Optionally, after the step of selecting a key sentence from the final classification as the abstract of the target object according to the score, the method further includes: judging whether the abstract contains set keywords or whether a regular expression corresponding to the abstract is in a set format; and if the key words are contained in the abstract or the regular expression corresponding to the abstract is in the set format, correcting the abstract by using set replacement information.
Optionally, after the step of extracting key sentences of the comment data of the plurality of target objects in the target area, the method further includes: filtering the key sentences according to a set third filtering rule; wherein the third filtering rule is used for limiting the length of the key sentence.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a digest generation apparatus for a target object.
An apparatus for generating an abstract of a target object according to an embodiment of the present invention includes: the extraction clustering module is used for extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object to obtain a first similarity, and clustering the key sentences according to the first similarity to obtain a basic classification of the target objects under dimensionality; the secondary clustering module is used for calculating the similarity between first classification groups included in basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity to obtain a final classification under a target region dimension; and the abstract generating module is used for calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score.
Optionally, the digest generation module is further configured to: counting the sum of the number of the first classification groups contained in the basic classification to obtain the number of the full-part classification groups; counting the sum of the occurrence times of all key sentences in the second classification group to obtain the occurrence times of the second classification group; and taking the number of the full part class groups as a numerator and adding 1 to the occurrence times of the second class group as a denominator to perform logarithmic operation to obtain the inverse document frequency.
Optionally, the digest generation module is further configured to: and according to the set weight, weighting and adding the frequency of the inverse document and the occurrence times of the second classification group to obtain the score of the second classification group.
Optionally, the digest generation module is further configured to: according to the value of the score, sorting the second classification groups contained in the final classification in a reverse order to select the first K second classification groups; wherein K is a positive integer; and respectively sorting the key sentences belonging to the first K second classification groups according to the occurrence times of the key sentences in the second classification groups, and respectively selecting the key sentences with the highest occurrence times in the first K second classification groups.
Optionally, the extracting and clustering module is further configured to: each sentence contained in the comment data is regarded as a node, the similarity among the nodes is calculated, and a node connection graph is constructed according to the similarity; and iteratively calculating the weight of the node according to the node connection graph and the similarity until the weight is converged, and selecting a sentence corresponding to the node with the highest weight in convergence as a key sentence.
Optionally, the apparatus further comprises: the preprocessing module is used for respectively preprocessing the comment data; the preprocessing comprises filtering processing and merging processing which are carried out according to a set first filtering rule, and the merging processing comprises the following steps: respectively carrying out syntactic analysis on sentences contained in the comment data to obtain syntactic components of the sentences; taking a set symbol as a segmentation identifier, judging the syntactic structure of a first sentence after the symbol, and upwards combining sentences with non-major-minor structures and non-central structures; the extraction clustering module is further configured to: and extracting key sentences of the preprocessed comment data.
Optionally, the apparatus further comprises: the optimization module is used for respectively carrying out sentiment analysis on key sentences contained in the basic classifications corresponding to the target objects; filtering the key sentences according to emotion analysis results and a set second filtering rule to obtain optimized classification of the target object under dimensionality; the second filtering rule is used for reserving a first classification group with the quantity of the key sentences with positive emotions larger than that of the key sentences with negative emotions, and the reserved key sentences are the key sentences with positive emotions; the secondary clustering module is further configured to: and calculating the similarity among the classification groups contained in the optimized classifications belonging to different target objects, and clustering the classification groups contained in the optimized classifications according to the obtained similarity.
Optionally, the apparatus further comprises: the correction module is used for judging whether the abstract contains set keywords or whether a regular expression corresponding to the abstract is in a set format; and if the key words are contained in the abstract or the regular expression corresponding to the abstract is in the set format, correcting the abstract by using set replacement information.
Optionally, the apparatus further comprises: the filtering module is used for filtering key sentences of the comment data of the target objects in the target area according to a set third filtering rule after the key sentences are extracted; wherein the third filtering rule is used for limiting the length of the key sentence.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided an electronic apparatus.
An electronic device of an embodiment of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the abstract generation method of the target object according to the embodiment of the invention.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable medium.
A computer-readable medium of an embodiment of the present invention stores thereon a computer program that, when executed by a processor, implements a digest generation method of a target object of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits:
1. the method comprises the steps of firstly clustering key sentences in the dimension of a target object, secondly clustering the primary clustering result in the dimension of a target area, and selecting proper key sentences as abstracts by combining the inverse document frequency, so that the occurrence of repeated sentences is reduced, high-quality and differential abstracts are generated, and the characteristics of the target object are better highlighted.
2. And calculating the frequency of the inverse documents based on the occurrence times of the second classification group, and then calculating the score of the second classification group by combining the weight so as to reduce the score of the key sentence with high occurrence times. And performing weighted summation operation on the frequency of the inverse documents and the occurrence times of the second classification group, so that the value of the key sentence with high occurrence times can be reduced, the value of the key sentence with low occurrence times can be improved, the finally selected key sentence is data with intermediate occurrence times, and the quality and the difference of the generated abstract are further ensured.
3. And selecting the key sentences as the abstract based on the scores and the occurrence times of the key sentences in the second classification group, thereby further highlighting the characteristics of the target object. A node connection graph is constructed through the similarity among the nodes, the node weight is further calculated, and the key sentence is selected from the comment data based on the weight, so that the automatic extraction of the key sentence is realized, and the extracted key sentence can reflect the core meaning of the comment data. By preprocessing the comment data through filtering and merging, the data processed subsequently can meet the data format requirement required by abstract generation.
4. And performing emotion analysis on the key sentences contained in the basic classification, and filtering the key sentences according to emotion analysis results, so that the generated abstract reflects forward emotion and is convenient for attracting users. And the set replacement information is used for correcting the sentences which are expressed in the abstract and are not written enough, so that the generated abstract is more written and formalized. The length of the key sentence is limited through the third filtering rule, the generated abstract is prevented from being too long or too short, and the flexibility is good.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of main steps of a digest generation method for a target object according to a first embodiment of the present invention;
fig. 2 is a schematic main flow chart of a digest generation method of a target object according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of classification of a head part, a middle part and a bottom part of comment data in a target object dimension according to a second embodiment of the invention;
fig. 4 is a schematic main flow chart of key sentence extraction according to the second embodiment of the present invention;
fig. 5 is a schematic main flow chart of a digest generation method of a target object according to a third embodiment of the present invention;
FIG. 6 is a diagram illustrating the result of syntactic analysis according to a third embodiment of the present invention;
FIG. 7 is a schematic diagram of the main modules of a digest generation apparatus of a target object according to an embodiment of the present invention;
FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 9 is a schematic diagram of a computer apparatus suitable for use in an electronic device to implement an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
At present, some enterprises have hundreds of thousands of hotels under the flag, the description content of the hotels displayed in the hotel list of the client is not enough, and in order to enrich the list content and highlight the characteristics of the hotels, a user can acquire the characteristics of the hotels more quickly and comprehensively, and the description information of the hotels needs to be extracted to display the user. Due to the fact that the number of hotels is large, and the workload of a manual handwriting mode is large, a mode capable of generating the summaries of the hotels in batches is needed, and selling points of the hotels are highlighted.
In the existing mode mentioned in the background technology, the abstracted abstract is too rigid, and the sentence homogenization problem is serious. The homogenization refers to a high repetition degree, such as "hotel sanitation and cleanness", "transportation convenience", and similar sentences with high occurrence frequency. But also some sentences based on user subjectivity or containing specific words, such as "the palace is close to me", "bed is comfortable and i like". In addition, the length of the sentences contained in the abstract in the prior art cannot be determined, the problems of overlong or overlong sentences exist, and the attraction force to users is poor.
In order to solve the above problems, the present invention provides a method for generating an abstract of a target object. The method greatly reduces sentences with high frequency of occurrence and reduces the homogenization problem while extracting the characteristic information of the hotel; meanwhile, the word number of the sentence can be controlled, and the sentence is prevented from being too long or too short.
In addition, a problem of wrongly written words may occur when a key sentence is extracted from comment data of a user, the times of wrongly written words occurring in a target area (such as province and city) range are greatly reduced based on the wrongly written words, and the threshold of the times of occurrence in the target area range is set when an abstract is generated, so that the wrongly written words in the abstract are avoided. The following examples are given for illustrative purposes.
Fig. 1 is a schematic diagram illustrating main steps of a method for generating a summary of a target object according to a first embodiment of the present invention. As shown in fig. 1, a method for generating an abstract of a target object according to a first embodiment of the present invention mainly includes the following steps:
step S101: extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object to obtain a first similarity, clustering the key sentences according to the first similarity to obtain a basic classification of the target objects under dimensionality.
The method comprises the steps of obtaining multiple pieces of comment data of multiple target objects in a target area, extracting key sentences in each piece of comment data in the same mode, calculating the similarity between any two key sentences belonging to the same target object by using a text similarity calculation method, and gathering the key sentences with the similarity larger than a first threshold value into one category, so that basic classification of the target objects under dimensionality can be obtained. Wherein the base classification of each target object comprises at least one first classification group.
The following explains how to extract a key sentence of one piece of comment data: and each sentence contained in the comment data is respectively regarded as a node, the similarity among the nodes is calculated to construct a node connection graph, then the weights of the nodes are iteratively calculated according to the similarity between the node connection graph and the nodes until the weights are converged, and the sentence corresponding to the node with the highest weight in the convergence is selected as a key sentence of the comment data.
Step S102: and calculating the similarity between first classification groups included in basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity to obtain the final classification under the target region dimension.
And calculating the similarity between the first classification groups contained in the basic classifications of any two target objects in the target area by using a text similarity algorithm, and grouping the first classification groups with the similarity larger than a second threshold value into one class to obtain the final classification under the dimension of the target area. It should be noted that, in calculating the similarity, the similarity between each first classification group of each target object in the target area and each first classification group of other target objects is calculated.
Step S103: and calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score.
And finally, the final classification comprises at least one second classification group, the inverse document frequency of each second classification group relative to the overall classification is respectively calculated, and then the inverse document frequency and the occurrence times of the second classification groups are weighted and added according to set weight to obtain the score of each second classification group. And (4) sorting the second classification groups contained in the final classification in a reverse order according to the values, selecting the first K second classification groups, and respectively selecting the key sentences with the most occurrence times from the first K second classification groups as the abstract of the target object, so that the characteristics of the target object can be better highlighted.
Because the abstract of the target object is obtained based on the comment data of the user, the comment data is directly processed in the first embodiment, and the generated abstract has quality problems because the data format requirements required for generating the abstract are not met. To solve this problem, the comment data needs to be preprocessed, and the following description will be made in detail with reference to example two.
Fig. 2 is a schematic main flow diagram of a digest generation method for a target object according to a second embodiment of the present invention. As shown in fig. 2, the method for generating an abstract of a target object according to the second embodiment of the present invention mainly includes the following steps:
step S201: and obtaining comment data of a plurality of target objects in the target area, and preprocessing the comment data. And preprocessing the comment data of all target objects in the target area so as to meet the data format requirement set for generating the abstract. In this embodiment, the target object may be any object in which comment data exists, such as a hotel, an article, a service, and the like.
In an embodiment, the preprocessing includes a filtering process and a merging process of the comment data. Wherein, the filtering process is realized based on setting a first filtering rule, and comprises pre-filtering before merging process and secondary filtering after merging process. Pre-filtering is used to filter out unsatisfactory review data. The secondary filtering is to filter the merged result of each piece of comment data again.
After the comment data is pre-filtered, each piece of obtained comment data needs to be merged. In merging, each sentence in the comment data may be parsed using a parsing tool to obtain the syntax components (i.e., subject, predicate, object, complement, etc.) of each sentence; then, the set symbol (such as punctuation) is used as a segmentation identifier, and the syntactic structure of the first sentence after the set symbol is judged so as to upwards combine the sentences with the structure of 'non-dominance/centering'.
Step S202: and extracting the key sentence of each piece of preprocessed comment data. The principle of key sentence extraction is as follows: drawing a weighted scoring standard, scoring each sentence, and then taking the sentence with the top score as a key sentence. The embodiment can be realized based on a TextRank algorithm. For a specific implementation of this step, refer to the description of fig. 4.
Step S203: and calculating the similarity between the key sentences belonging to the same target object to obtain the first similarity between the key sentences contained in each target object. In the natural language processing process, how to measure the similarity between two texts is often involved, and the similarity between the texts is utilized to perform de-duplication preprocessing on a large-scale corpus, or to find a related name of a certain entity name.
The similarity between two key sentences is calculated as follows: and dividing words of the two key sentences respectively, calculating the word frequency-inverse document frequency of each word, and finally calculating the included angle between the two key sentences according to the word frequency-inverse document frequency and a cosine similarity algorithm. The smaller the included angle value is, the higher the similarity of the two key sentences is. The following examples are given.
The key sentences under a certain target object are assumed as follows: sentence one: "very popular for service" and sentence two: "service is particularly enthusiastic". Firstly, segmenting words of a sentence I and a sentence II respectively, and obtaining the word segmentation result of the sentence I as follows: { service, very, enthusiasm }, the word segmentation result of sentence two is: { service, special, enthusiasm }. Therefore, the word set obtained by word segmentation is as follows: { service, very, enthusiastic, special }.
And then calculating the Frequency of each word-Inverse Document Frequency (TF-IDF, Term Frequency-Inverse Document Frequency). Specifically, the word frequency of each word in the word set in the sentence a and the sentence B is respectively calculated, and the word frequency of the sentence a is: service 1, very 1, enthusiasm 1, special 0, and the word frequency of sentence B is: service 1, very 0, enthusiasm 1, special 1; and then, vectorizing the word frequency to obtain that the word frequency vector of the sentence A is (1, 1, 1, 0), and the word frequency vector of the sentence B is (1, 0, 1, 1).
And finally, calculating a cosine value between the included angles of the two word frequency vectors based on the word frequency vectorization result. The larger the cosine value, the higher the similarity. The cosine value between the two vector included angles is calculated according to the following formula:
Figure BDA0002406087230000111
in this equation, cos θ is the cosine between the angles between vector a and vector b.
By the formula, the cosine value between the included angles of the two word frequency vectors can be calculated as follows:
Figure BDA0002406087230000112
step S204: and clustering key sentences belonging to the same target object according to the first similarity to obtain basic classification of the target object under dimensionality. In clustering, a similarity threshold, referred to herein as a first threshold, may be preset. In an embodiment, the first threshold may be set to 0.6, 0.7, 0.8, etc. Taking the first threshold value of 0.6 as an example, clustering key sentences with the first similarity greater than 0.6 into one class, and taking the key sentence with the largest occurrence frequency in each class as the group name of the classification group, so as to obtain the basic classification of the target object under the dimensionality.
Step S205: and calculating the similarity among the first classification groups in the basic classification belonging to different target objects to obtain a second similarity. And the second similarity is the similarity between each first classification group of each target object in the target area and each first classification group of other target objects. The similarity calculation method in this step is the same as that in step S203, and will not be described herein again.
Step S206: and clustering the first classification groups belonging to different target objects according to the second similarity to obtain the final classification under the target area dimension. After the second similarity is calculated, the classification groups with the second similarity greater than the second threshold are aggregated into a class according to a preset similarity threshold (referred to as a second threshold herein), for example, 0.5, so as to obtain a final classification under the dimension of the target area. Wherein the final classification includes at least one second classification group.
Step S207: and calculating the inverse document frequency of each second classification group of the final classification to the overall classification so as to calculate the score of each second classification group of the final classification according to the inverse document frequency and the set weight. In order to calculate the inverse document frequency, the number of the full-part class groups and the occurrence number of each second class group need to be counted, and then the inverse document frequency of each second class group for the overall class can be calculated according to formula 2.
The inverse document frequency for a second classification group for the overall classification is calculated as follows:
Figure BDA0002406087230000113
in the formula, the number of the full-part classification groups is the sum of the number of the first classification groups included in the basic classification, and the occurrence number of the second classification group is the sum of the occurrence numbers of all the key sentences in the second classification group in the final classification.
The score of a second classification group is calculated as follows:
score of the second classification component ═
The number of occurrences of the second classification group is weight 1+ the IDF of the second classification group is weight 2
Equation 3
In this equation, the weight 1+ the weight 2 is 1, the weight 1> the weight 2, and specific values may be set such that the weight 1 is 0.7 and the weight 2 is 0.3 (the values are obtained by experiments).
Step S208: and determining the abstract of the target object according to the value of each finally classified second classification group and the occurrence frequency of each key sentence in the second classification group. And (4) sorting the scores of each second classification group of the final classification in a reverse order, and selecting the first K second classification groups. And then sorting the key sentences belonging to the K second classification groups according to the occurrence times, and selecting the key sentence with the most occurrence times in each second classification group as the abstract of the target object. And the value of K can be set by user.
Referring to fig. 3, the classification of the comment data in the dimension of the target object can be divided into three parts, namely a head part, a middle part and a bottom part. The ratio of the number of times of the appearance of a sentence is determined by the head, the middle and the bottom. As shown in fig. 3, "sanitary clean" is the highest percentage, and the bottom is the distinctive style that the target object has.
For a "problematic comment" of a target object, it would theoretically appear in the bottom or middle, but if the comment is enlarged to the target area dimension, it would be very low, and it would only be possible to appear in the bottom category. The head classification is solved in an inverse document frequency mode, the occurrence frequency is high at the head, and the inverse document frequency is high at the tail, so that the data of the middle part can be obtained according to the formula 3, the finally obtained summary information is obtained from the middle part, and the high quality and the difference of the obtained summary information are ensured.
The "question comment" here refers to a sentence with wrongly written words, or a sentence containing a word specific to the target object. For example, "the hotel is a safe door next to the hotel," and for example, "the fate is very close" in fig. 3, the number of occurrences in the review data of the hotel may be very high, but the number of occurrences may be reduced if the comment is placed in the dimension of the province.
Fig. 4 is a schematic main flow diagram of key sentence extraction according to the second embodiment of the present invention. As shown in fig. 4, the implementation method of extracting a key sentence (i.e., step S202) in the second embodiment of the present invention, taking extracting a key sentence of a piece of preprocessed comment data (hereinafter referred to as a given text) as an example, mainly includes the following steps:
step S401: each sentence contained in the given text is regarded as a node, and the similarity between the nodes is calculated. The formula for measuring the similarity between nodes is as follows:
Figure BDA0002406087230000131
in the formula, Si、SjRespectively representing the ith and jth sentences, ω, of a given textkRepresenting words in sentences, log (| S)i|)、log(|Sj|) represents the logarithm of the number of words in the ith and jth sentences, respectively, and the numerator part represents the number of the same word appearing in the ith and jth sentences at the same time.
Step S402: and constructing a node connection graph according to the similarity among the nodes. Presetting a connection threshold, and if the similarity between the nodes is more than or equal to the connection threshold, considering that the two corresponding sentences are similar, wherein an undirected weighted edge exists between the nodes; if the similarity between the nodes is less than the connection threshold, no connection edge exists between the nodes. The node connection graph may be represented by G ═ V, E, W, where V is a set of nodes, E is a set of edges between nodes, and W is a set of weights on the edges.
Step S403: and (4) iteratively calculating the weight of each node until convergence. The calculation formula of the node weight is as follows:
Figure BDA0002406087230000132
in the formula, Vi、VjRespectively representing the ith node and the jth node of the node set V; WS (V)i) Representing the last iteration after node ViThe weight of (c); d is the damping coefficient, typically set to 0.85; in (V)i) Indicates a point ViA set of nodes of (c); out (V)j) Represents VjThe other set of nodes pointed to; omegaijRepresents a node ViAnd node VjThe similarity between them.
When the weights of the nodes are calculated for the first iteration, the initial weights of the nodes need to be set in a self-defined mode, after multiple iterations, the weight of each node tends to be stable, and convergence is considered at the moment.
Step S404: and sequencing the weights of the nodes, and taking the sentence corresponding to the node with the highest weight as a key sentence. Assuming that a given text includes two sentences of ' the hotel is well served ' and ' this is rarely found, ' the hotel is well served ' with representative meaning and high importance, and then a high weight is given, ' the hotel is rarely found ' with meaningless meaning and low importance, and then the weight is low, and finally the extracted key sentence is ' the hotel is well served '.
The embodiment limits the format of the comment data, but some comment data belong to negative evaluation, and if the summary is generated based on the negative evaluation, the comment data cannot play a role in attracting users. In order to solve the problem, in the third embodiment, emotion analysis is performed on the extracted key sentences to filter negative-emotion sentences. The details will be described below.
Fig. 5 is a schematic main flow diagram of a digest generation method for a target object according to a third embodiment of the present invention. As shown in fig. 5, the method for generating an abstract of a target object according to the third embodiment of the present invention mainly includes the following steps:
step S501: and obtaining comment data of a plurality of target objects in the target area, and preprocessing the comment data. In this embodiment, the target object is a hotel. The preprocessing comprises preprocessing such as preprocessing, merging and secondary filtering of the hotel comment data.
Among them, pre-filtering is used to filter out unsatisfactory comment data such as short sentences, repeated sentences, and sentences containing specified characters. The short sentence may be a sentence with a total number of words smaller than a predetermined number, or a sentence with a number of words smaller than a predetermined number after a character is designated. The designated characters can be defined by themselves, such as words, punctuation marks, mathematical symbols, special symbols, and the like belonging to the filter library. The words to be filtered are stored in the filter library.
The repeated sentences may be repeated words or sentences with repeated words larger than a predetermined value. For example, "good and good hotel is very good", "this hotel" is the hotel "is general". The sentence containing the designated character can be a sentence containing a filtering word, and a sentence with punctuation marks, mathematical marks and special marks which continuously appear more than the designated times. For example, "the environment is good", "and" can also be- "to.
And merging each piece of comment data obtained after the pre-filtering. In practical application, because the syntactic analysis tool has errors in identifying punctuation marks, the embodiment of the invention provides a preferable merging mode. Before syntactic analysis, punctuation marks of each sentence in the comment data are replaced by spaces, and the sentence after replacement is subjected to syntactic analysis to obtain syntactic components (namely, a subject, a predicate, an object, a fixed term, a subject, a complement and the like) of each sentence; and then, taking the space as a segmentation identifier, judging the syntactic structure of the first sentence after the space, and combining the sentences with the structure of 'non-dominance/centering' upwards. The following examples are given.
Suppose that a piece of comment data obtained after pre-filtering is: "staff are well served, even if outside non-resident halls are toilet-conscious, they are still in place. The window is provided with a screen window, and ventilation is difficult to achieve. After replacing all punctuation marks of the segment of characters with spaces, performing syntactic analysis, and obtaining a result as shown in fig. 6.
Then, each sentence obtained in fig. 6 is merged upward with a space as a division mark. Taking the first 6 sentences of fig. 6 as an example, the first 5 sentences are: staff, service, high, good, and up-merge is required because line 5 is empty, but line 6 is labeled as a middle-of-shape structure, not a king/centering structure. So the sentences obtained after merging are: "staff service is good, even outside non-resident hall borrow the guide of lavatory will be keen".
The secondary filtering is to filter the merged result of each piece of comment data again. For example, a plurality of sentences obtained by merging are subjected to filtering of repeated words and phrases, or sentences having a word count of 2 or less are filtered. Still referring to the above example, the results of the second filtering are as follows:
the staff can be well served, and the people can be guided to see that the service is in place with the help of the toilet in the hall where people are not outside
The window can be provided with the screen window
This is rarely achieved
Step S502: and extracting the key sentence of each piece of preprocessed comment data. For a specific implementation of this step, refer to the description of fig. 4.
Step S503: and calculating the similarity between the key sentences belonging to the same target object to obtain the first similarity between the key sentences contained in each target object. The implementation manner of this step is the same as step S203, and is not described herein again.
Step S504: and clustering key sentences belonging to the same target object according to the first similarity to obtain basic classification of the target object under dimensionality. And clustering key sentences with the first similarity larger than a first threshold into one class, and taking the key sentence with the most occurrence times in each class as the group name of the classification group to obtain the basic classification of the target object under the dimensionality. The following examples are given.
Suppose that the key sentences under a certain hotel include: few taxis, high cost performance, good cost performance, in-place service and convenience stores around the taxis. After the clustering treatment is carried out, the basic classification corresponding to the hotel can be obtained, and the specific result is shown in table 1. As can be seen from table 1, the basic classification includes 4 first classification groups.
Table 1 shows the basic classification results corresponding to the hotel
Figure BDA0002406087230000161
Step S505: and performing sentiment analysis on the key sentences in the basic classification corresponding to the target objects respectively, and performing filtering processing according to a set second filtering rule to obtain optimized classification of the target objects under the dimensionality. And performing sentiment analysis on each key sentence in the basic classification of each target object respectively to judge the sentiment tendency of each key sentence, then filtering the key sentences according to a set second filtering rule to obtain optimized classifications corresponding to a plurality of target objects, and counting the occurrence times of each first classification group.
Wherein the second filtering rule is: if the number of the key sentences with negative emotions in a first classification group is larger than or equal to the number of the key sentences with positive emotions, discarding the first classification group; and if the number of the key sentences with negative emotions in a first classification group is less than that of the key sentences with positive emotions, keeping the key sentences with positive emotions in the first classification group. The second filtering rule is used for keeping the key sentences with positive emotions in the first classification group, wherein the number of the key sentences with positive emotions is larger than that of the key sentences with negative emotions.
There are two ways to perform emotion analysis on the key sentences, one way is realized based on a word stock, and the other way is realized based on a machine learning algorithm. Training and testing sets are required to be prepared for machine learning, then a classification algorithm is used for training a classification model, and emotion analysis can be carried out after the model is trained. The following description will be made by taking a word stock-based implementation as an example.
A word bank is constructed in advance according to the practical scene of the comment data of the hotel, and the word bank comprises the following dictionaries: a stop word dictionary, a positive evaluation word dictionary, a negative evaluation word dictionary, a degree word dictionary, and a negative word dictionary. Wherein, the stop words are: and, get, between, etc., positive evaluation words such as: cheap, clean, beautiful, high-quality and low-price, and the like, and negative evaluation words such as: dirty, bad, etc., and the term of degree such as: very good, special, fair, general, etc., negatives such as: difficulty, inequality, etc.
Each of the positive evaluation word, the negative evaluation word, and the degree word has a score, the score of the positive evaluation word is 1, the score of the negative evaluation word is-1, and the scores of the degree words are, for example, 3.0, 2.0, 0.8, 0.5, and 0.5 (corresponding to the above 5 degree words, respectively).
Based on the word stock, calculating the emotional tendency of the keywords according to the following modes: and (4) segmenting the key sentence, then calculating the emotion score of the key sentence according to each word obtained by segmenting, and obtaining the emotional tendency according to the emotion score. The calculation formula of the emotion score can be as follows:
feeling score (-1)Number of negative wordsScore of degree word score of evaluation word
Equation 6
Taking "cost performance is high" as an example, if the number of negative words is 0, the score of the degree word "high" is 1, and the score of the evaluation word "high" is 1, the emotion score is 1 × 1 — 1, and the keyword is a key sentence of the forward emotion.
Step S506: and calculating the similarity among all the classification groups in the optimized classification belonging to different target objects, and clustering all the classification groups of the optimized classification according to the similarity to obtain the final classification under the target region dimension. The similarity calculation method in this step is the same as that in step S203, and will not be described herein again. The following examples are given.
Assuming that the target area is a province and the target object is a hotel, the province has N hotels, and each hotel has M pieces of comment data, so that N × M pieces of comment data are shared. And (4) preprocessing, extracting key sentences, clustering, analyzing emotions and filtering the N × M pieces of comment data according to the steps S501-S505 to obtain X classification groups. And performing similarity calculation on the group names of the X classification groups, and if the similarity of the two group names is more than 0.5, clustering the two group names into one class.
For convenience of explanation, the province is set to have 2 hotels, and the obtained optimized classification results are shown in tables 2 and 3 after the comment data of hotels 1 and 2 are processed according to steps S501 to S505.
Table 2 shows the optimized classification results of the review data of Hotel 1
Figure BDA0002406087230000171
Table 3 shows the optimized classification results of the review data of Hotel 2
Figure BDA0002406087230000172
Figure BDA0002406087230000181
The similarity between 'very high cost performance' and 'very good cost performance' is calculated to be more than 0.5, so that the two classification groups can be classified into one class; the similarity between "service is also in place" and "service is in place" is greater than 0.5, so the two categorical groups can be grouped into one category. Therefore, the final classification results in the provincial dimension are shown in table 4.
Table 4 shows the final classification results of the review data in provincial dimensions
Figure BDA0002406087230000182
Step S507: and calculating the inverse document frequency of each second classification group of the final classification to the overall classification so as to calculate the score of each second classification group of the final classification according to the inverse document frequency and the set weight. The implementation process of this step is the same as step S207, and is not described here again. However, in this embodiment, since the optimized classification is obtained through the emotion analysis in step S505, the number of all classification groups in formula 2 is modified to be the sum of the number of classification groups included in the optimized classification.
Step S508: and determining the abstract of the target object according to the value of each finally classified second classification group and the occurrence frequency of each key sentence in the second classification group. The implementation process of this step is the same as step S208, and is not described here again. According to the embodiment, the abstracts of the hotels can be generated in batches, the workload of operators is reduced to a large extent, and the abstraction generation efficiency is improved.
In a preferred embodiment, if the key sentence with the largest number of occurrences does not satisfy the set condition limit, where the condition is generally the word number length, the second key sentence with the second order of occurrence is selected in sequence, and so on, and the summary information of the target object can be obtained.
In another preferred embodiment, the target object is a hotel with the same grade, such as an XX chain hotel, an XX quick hotel, etc. with the same star grade under the flag of a certain company. When a user reviews the hotels in different positions in the same province and city, different experiences of the user may be caused due to the difference of the positions, services, prices and the like of the hotels, and finally different review data are generated. Through the processing of steps S501 to S508, a high-quality, differentiated summary can be generated for the hotel based on the comment data.
In an actual business scenario, when similarity calculation is performed on the key sentences in step S203 and step S503, it is preferable to use sentences that satisfy business rules and have additional words deleted. Therefore, in a preferred embodiment, the key sentences extracted in step S202 and step S502 need to be filtered by using a preset third filtering rule, and then similarity calculation is performed on the filtered key sentences.
The third filtering rule is set by a user in a self-defined way, and may be: filtering includes only the cardinal predicate/medium relationship and sentences not beginning with the cardinal predicate/medium relationship, filtering the right additional words in the sentences, filtering sentences with length greater than 8 or less than 4, filtering sentences containing sensitive words, etc. It should be noted that limiting the sentence length, filtering the sensitive words belongs to the business rules, and the length of the remaining sentence is defined by the user.
The sensitive words of this embodiment are manually maintained in the database by the user. The preprocessing of step S201 and step S501 only performs coarse-grained filtering on the comment data, and the sensitive words here are used for performing finer-grained filtering. For the application scene of the hotel, sensitive words such as bad, noisy, messy, mosquito, decoration, construction, coincidence and the like are used.
For example, the key sentences extracted in step S202 are: the taxi is few, the cost performance is high, the cost performance is good, the taxi is good on the whole, the service is well supported, and convenience stores are arranged around the taxi; after filtering according to the above rule, the key sentences obtained are: the taxi has few taxis (the additional word at the end of the sentence is deleted), high cost performance, good service (the additional word at the end of the sentence is deleted) and convenience stores are arranged around the taxi. In the key sentence of "good overall", the key sentence is discarded because "good overall" is a sensitive word.
Because the digest is extracted from the user's review data, some expressions may not be sufficiently bookmarked. Therefore, in another preferred embodiment, after the summary is obtained in step S208 and step S508, the summary may be performed with a color rendering. The touch-down is mainly to replace the key words contained in the abstract information by the appointed words so as to enable the obtained abstract information to be more written. Wherein the keywords are maintained internally by the system.
Specifically, whether the abstract contains set keywords or whether a regular expression corresponding to the abstract is in a set format is judged; and if the abstract contains the key words or the regular expression corresponding to the abstract is in a set format, correcting the abstract by using the set replacement information.
For example, if the summary information includes keywords such as "equivalent, true" and the like, then "very" may be replaced; if keywords such as 'right next and right going out' and the like are contained in the abstract information, the abstract information can be replaced by 'nearby'; if the summary information contains keywords such as children, son, daughter, baby and baby, the summary information can be directly replaced by parent-child service; if the sentence pattern of the regular expression corresponding to the summary information is ". once (train | motor car),. once (around | beside | train | motor car)," etc., the summary information can be directly replaced by "adjacent train station".
According to the abstract generation method of the target object, the key sentences are firstly clustered under the dimension of the target object, then the primary clustering result is clustered secondarily under the dimension of the target area, and the appropriate key sentences are selected as the abstract by combining the inverse document frequency, so that the occurrence of repeated sentences is reduced, high-quality and differential abstract is generated, and the characteristics of the target object are better highlighted.
Fig. 7 is a schematic diagram of main modules of a digest generation apparatus of a target object according to an embodiment of the present invention. As shown in fig. 7, the apparatus 700 for generating a summary of a target object according to an embodiment of the present invention mainly includes:
the extraction clustering module 701 is configured to extract key sentences of comment data of multiple target objects in a target region, calculate similarity between key sentences belonging to the same target object, obtain a first similarity, cluster the key sentences according to the first similarity, and obtain a basic classification of the target objects under dimensionality.
The method comprises the steps of obtaining multiple pieces of comment data of multiple target objects in a target area, extracting key sentences in each piece of comment data in the same mode, calculating the similarity between any two key sentences belonging to the same target object by using a text similarity calculation method, and gathering the key sentences with the similarity larger than a first threshold value into one category, so that basic classification of the target objects under dimensionality can be obtained. Wherein the base classification of each target object comprises at least one first classification group.
The following explains how to extract a key sentence of one piece of comment data: and each sentence contained in the comment data is respectively regarded as a node, the similarity among the nodes is calculated to construct a node connection graph, then the weights of the nodes are iteratively calculated according to the similarity between the node connection graph and the nodes until the weights are converged, and the sentence corresponding to the node with the highest weight in the convergence is selected as a key sentence of the comment data.
The secondary clustering module 702 is configured to calculate similarity between first classification groups included in basic classifications assigned to different target objects to obtain a second similarity, and perform clustering on the first classification groups according to the second similarity to obtain a final classification under a target region dimension.
And calculating the similarity between the first classification groups contained in the basic classifications of any two target objects in the target area by using a text similarity algorithm, and grouping the first classification groups with the similarity larger than a second threshold value into one class to obtain the final classification under the dimension of the target area. It should be noted that, in calculating the similarity, the similarity between each first classification group of each target object in the target area and each first classification group of other target objects is calculated
The abstract generating module 703 is configured to calculate an inverse document frequency of a second classification group included in the final classification with respect to the overall classification, calculate a score of the second classification group according to the inverse document frequency and a set weight, and select a key sentence from the final classification as an abstract of the target object according to the score.
And finally, the final classification comprises at least one second classification group, the inverse document frequency of each second classification group relative to the overall classification is respectively calculated, and then the inverse document frequency and the occurrence times of the second classification groups are weighted and added according to set weight to obtain the score of each second classification group. And (4) sorting the second classification groups contained in the final classification in a reverse order according to the values, selecting the first K second classification groups, and respectively selecting the key sentences with the most occurrence times from the first K second classification groups as the abstract of the target object, so that the characteristics of the target object can be better highlighted.
In addition, the apparatus 700 for generating a summary of a target object according to an embodiment of the present invention may further include: a pre-processing module, an optimization module, a modification module, and a filtering module (not shown in fig. 7), the functions of each of which are as described above.
From the above description, it can be seen that the key sentences are firstly clustered in the target object dimension, then the primary clustering result is secondarily clustered in the target area dimension, and appropriate key sentences are selected as the abstract in combination with the inverse document frequency, so that the occurrence of repeated sentences is reduced, high-quality and differentiated abstract is generated, and the characteristics of the target object are better highlighted.
Fig. 8 illustrates an exemplary system architecture 800 of a digest generation method of a target object or a digest generation apparatus of a target object to which an embodiment of the present invention may be applied.
As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The terminal devices 801, 802, 803 may have installed thereon various communication client applications, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.
The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 805 may be a server that provides various services, such as a background management server that an administrator processes using comment data. The background management server can extract key sentences of the comment data, perform clustering, inverse document frequency calculation, scoring and other processing, and feed back processing results (such as generated summaries) to the terminal equipment.
It should be noted that the digest generation method for the target object provided in the embodiment of the present application is generally executed by the server 805, and accordingly, the digest generation apparatus for the target object is generally disposed in the server 805.
It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.
The electronic device of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the abstract generation method of the target object according to the embodiment of the invention.
The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a digest generation method of a target object of an embodiment of the present invention.
Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the computer system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, the processes described above with respect to the main step diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing the method illustrated in the main step diagram. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an extract clustering module, a quadratic clustering module, and a summary generation module. For example, the extraction clustering module may be further described as a module for extracting key sentences of comment data of a plurality of target objects in a target area, calculating similarity between key sentences belonging to the same target object to obtain a first similarity, and clustering the key sentences according to the first similarity to obtain a basic classification under the dimensionality of the target object.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object to obtain a first similarity, clustering the key sentences according to the first similarity to obtain a basic classification of the target objects under dimensionality; calculating the similarity between first classification groups included in basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity to obtain a final classification under a target region dimension; and calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score.
According to the technical scheme of the embodiment of the invention, the key sentences are firstly clustered under the dimension of the target object, then the primary clustering result is clustered under the dimension of the target area, and the proper key sentences are selected as the abstract by combining the inverse document frequency, so that the occurrence of repeated sentences is reduced, high-quality and differential abstract is generated, and the characteristics of the target object are better highlighted
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for generating a summary of a target object, comprising:
extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object to obtain a first similarity, clustering the key sentences according to the first similarity to obtain a basic classification of the target objects under dimensionality;
calculating the similarity between first classification groups included in basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity to obtain a final classification under a target region dimension;
and calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score.
2. The method of claim 1, wherein computing the inverse document frequency of the second classification group contained in the final classification for the overall classification comprises:
counting the sum of the number of the first classification groups contained in the basic classification to obtain the number of the full-part classification groups;
counting the sum of the occurrence times of all key sentences in the second classification group to obtain the occurrence times of the second classification group;
and taking the number of the full part class groups as a numerator and adding 1 to the occurrence times of the second class group as a denominator to perform logarithmic operation to obtain the inverse document frequency.
3. The method of claim 2, wherein calculating scores for the second taxonomic group based on the inverse document frequency and set weights comprises:
and according to the set weight, weighting and adding the frequency of the inverse document and the occurrence times of the second classification group to obtain the score of the second classification group.
4. The method of claim 1, wherein selecting key sentences from the final classification based on the score comprises:
according to the value of the score, sorting the second classification groups contained in the final classification in a reverse order to select the first K second classification groups; wherein K is a positive integer;
and respectively sorting the key sentences belonging to the first K second classification groups according to the occurrence times of the key sentences in the second classification groups, and respectively selecting the key sentences with the highest occurrence times in the first K second classification groups.
5. The method of claim 1, wherein extracting key sentences of comment data of a plurality of target objects within a target area comprises:
each sentence contained in the comment data is regarded as a node, the similarity among the nodes is calculated, and a node connection graph is constructed according to the similarity;
and iteratively calculating the weight of the node according to the node connection graph and the similarity until the weight is converged, and selecting a sentence corresponding to the node with the highest weight in convergence as a key sentence.
6. The method of claim 1, wherein the step of extracting key sentences of the comment data of the plurality of target objects within the target area is preceded by the method further comprising:
preprocessing the plurality of pieces of comment data respectively; the preprocessing comprises filtering processing and merging processing which are carried out according to a set first filtering rule, and the merging processing comprises the following steps:
respectively carrying out syntactic analysis on sentences contained in the comment data to obtain syntactic components of the sentences;
taking a set symbol as a segmentation identifier, judging the syntactic structure of a first sentence after the symbol, and upwards combining sentences with non-major-minor structures and non-central structures;
the method for extracting the key sentences of the comment data of the plurality of target objects in the target area comprises the following steps: and extracting key sentences of the preprocessed comment data.
7. The method according to any of claims 1 to 6, characterized in that the step of calculating the similarity between the first classification groups comprised by the underlying classifications assigned to the different target objects is preceded by the method further comprising:
performing sentiment analysis on key sentences contained in the basic classifications corresponding to the target objects respectively;
filtering the key sentences according to emotion analysis results and a set second filtering rule to obtain optimized classification of the target object under dimensionality;
the second filtering rule is used for reserving a first classification group with the quantity of the key sentences with positive emotions larger than that of the key sentences with negative emotions, and the reserved key sentences are the key sentences with positive emotions;
the calculating the similarity between the first classification groups included in the basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity, comprising:
and calculating the similarity among the classification groups contained in the optimized classifications belonging to different target objects, and clustering the classification groups contained in the optimized classifications according to the obtained similarity.
8. The method according to any one of claims 1 to 6, characterized in that after the step of selecting a key sentence from the final classification as the summary of the target object according to the score, the method further comprises:
judging whether the abstract contains set keywords or whether a regular expression corresponding to the abstract is in a set format;
and if the key words are contained in the abstract or the regular expression corresponding to the abstract is in the set format, correcting the abstract by using set replacement information.
9. The method according to any one of claims 1 to 6, characterized in that after the step of extracting key sentences of comment data of a plurality of target objects within a target area, the method further comprises:
filtering the key sentences according to a set third filtering rule; wherein the third filtering rule is used for limiting the length of the key sentence.
10. An apparatus for generating a summary of a target object, comprising:
the extraction clustering module is used for extracting key sentences of comment data of a plurality of target objects in a target area, calculating the similarity between the key sentences belonging to the same target object to obtain a first similarity, and clustering the key sentences according to the first similarity to obtain a basic classification of the target objects under dimensionality;
the secondary clustering module is used for calculating the similarity between first classification groups included in basic classifications belonging to different target objects to obtain a second similarity, and clustering the first classification groups according to the second similarity to obtain a final classification under a target region dimension;
and the abstract generating module is used for calculating the inverse document frequency of a second classification group contained in the final classification relative to the overall classification, calculating the score of the second classification group according to the inverse document frequency and the set weight, and selecting a key sentence from the final classification as the abstract of the target object according to the score.
11. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.
12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.
CN202010161869.XA 2020-03-10 2020-03-10 Method and device for generating abstract of target object Pending CN113377949A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010161869.XA CN113377949A (en) 2020-03-10 2020-03-10 Method and device for generating abstract of target object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010161869.XA CN113377949A (en) 2020-03-10 2020-03-10 Method and device for generating abstract of target object

Publications (1)

Publication Number Publication Date
CN113377949A true CN113377949A (en) 2021-09-10

Family

ID=77568683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010161869.XA Pending CN113377949A (en) 2020-03-10 2020-03-10 Method and device for generating abstract of target object

Country Status (1)

Country Link
CN (1) CN113377949A (en)

Similar Documents

Publication Publication Date Title
Gokulakrishnan et al. Opinion mining and sentiment analysis on a twitter data stream
CN105183833B (en) Microblog text recommendation method and device based on user model
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN110032639B (en) Method, device and storage medium for matching semantic text data with tag
CN108073568A (en) keyword extracting method and device
CN113282701B (en) Composition material generation method and device, electronic equipment and readable storage medium
Raghuvanshi et al. A brief review on sentiment analysis
CN109947934A (en) For the data digging method and system of short text
CN112100396A (en) Data processing method and device
CN111861596A (en) Text classification method and device
Singh et al. Sentiment analysis using lexicon based approach
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN113268560A (en) Method and device for text matching
Saito et al. Classifying user reviews at sentence and review levels utilizing Naïve Bayes
CN114036921A (en) Policy information matching method and device
Khan et al. Urdu sentiment analysis
O'Connor Statistical Text Analysis for Social Science.
CN116108181A (en) Client information processing method and device and electronic equipment
Aljević et al. Extractive text summarization based on selectivity ranking
Makinist et al. Preparation of improved Turkish dataset for sentiment analysis in social media
CN114255067A (en) Data pricing method and device, electronic equipment and storage medium
Rubtsova Automatic term extraction for sentiment classification of dynamically updated text collections into three classes
CN110728131A (en) Method and device for analyzing text attribute
CN113377949A (en) Method and device for generating abstract of target object
Le A hybrid method for text-based sentiment analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination