CN112434515A - Statement compression method and device, electronic equipment and readable storage medium - Google Patents

Statement compression method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112434515A
CN112434515A CN202011386421.4A CN202011386421A CN112434515A CN 112434515 A CN112434515 A CN 112434515A CN 202011386421 A CN202011386421 A CN 202011386421A CN 112434515 A CN112434515 A CN 112434515A
Authority
CN
China
Prior art keywords
sentence
sentences
target
key
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011386421.4A
Other languages
Chinese (zh)
Inventor
刘臣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianmian Information Technology Shenzhen Co ltd
Original Assignee
Tianmian Information Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianmian Information Technology Shenzhen Co ltd filed Critical Tianmian Information Technology Shenzhen Co ltd
Priority to CN202011386421.4A priority Critical patent/CN112434515A/en
Publication of CN112434515A publication Critical patent/CN112434515A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to data processing, and discloses a statement compression method, which comprises the following steps: performing spoken language removal processing on the sentences to be compressed to obtain a target sentence set, and judging whether the number of sentences in the target sentence set is greater than a first threshold value or not; when the number of sentences in the target sentence set is judged to be larger than a first threshold value, sorting the importance of the sentences in the target sentence set, extracting key sentences based on a sorting result, and judging whether the sentence length of the key sentences is larger than a second threshold value; and when the sentence length of the key sentence is judged to be larger than the second threshold value, extracting the trunk words of the key sentence, and splicing the trunk words to obtain the target sentence. The invention also provides a sentence compression device, electronic equipment and a readable storage medium. The invention reduces the labeling cost and ensures the semantic accuracy of the compressed statement.

Description

Statement compression method and device, electronic equipment and readable storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for compressing a statement, an electronic device, and a readable storage medium.
Background
The sentence compression technology is an important research direction in the field of natural language processing, redundant information in sentences can be removed through sentence compression processing, theme ideas are retained, reading and machine recognition of a user are facilitated, and the sentence compression technology can be used in multiple fields such as abstract generation, problem matching and theme extraction.
At present, a generative or extractable statement compression method is usually adopted, however, the generative statement compression method requires a large amount of labeled corpora for supervised learning, and the method is not suitable for the situations of limited project size and cost, small business data volume and labeled data loss; the traditional extraction type sentence compression method is related to the length of the sentence, and when the sentence is long, the compression effect of the extraction type compression method is not ideal, and the semantic information cannot be accurately reserved. Therefore, a sentence compression method is needed to reduce the labeling cost and ensure the semantic accuracy of the compressed sentences.
Disclosure of Invention
In view of the above, there is a need to provide a sentence compression method, which aims to reduce the labeling cost and ensure the semantic accuracy of the compressed sentences.
The statement compression method provided by the invention comprises the following steps:
analyzing a statement compression request sent by a user based on a client, acquiring a to-be-compressed statement carried by the request, performing spoken language removal processing on the to-be-compressed statement to obtain a target statement set, and judging whether the number of sentences in the target statement set is greater than a first threshold value or not;
when the number of sentences in the target sentence set is judged to be larger than a first threshold value, sorting the importance of the sentences in the target sentence set, extracting key sentences based on a sorting result, and judging whether the sentence length of the key sentences is larger than a second threshold value;
and when the sentence length of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, and splicing the trunk words to obtain the target sentence.
Optionally, the performing spoken language removal processing on the to-be-compressed statement includes:
acquiring a spoken sentence dictionary from a first database, comparing each first clause in the sentences to be compressed with the spoken sentence dictionary, and deleting a specified first clause if one specified first clause is matched with one sentence in the spoken sentence dictionary to obtain an initial sentence set;
performing word segmentation processing on the sentences in the initial sentence set to obtain a first word sequence;
recognizing the spoken words in the first word sequence based on a spoken word recognition model, and deleting the spoken words to obtain a second word sequence;
and splicing the words in the second word sequence according to the positions of the words in the sentence to be compressed to obtain a plurality of second clauses, and taking the set of the second clauses as a target sentence set.
Optionally, the sorting the importance of the sentences in the target sentence set and extracting the key sentences based on the sorting result include:
combining each sentence in the target sentence set with other sentences pairwise to obtain a plurality of combination pairs;
calculating similarity values of two sentences of each combination pair in the plurality of combination pairs, and determining a similarity matrix corresponding to the target sentence set based on the similarity values;
and calculating the importance scores of the sentences in the target sentence set based on the similarity matrix, sequencing the sentences in the target sentence set according to the sequence of the importance scores from high to low, and taking the sentence with the highest sequence as a key sentence.
Optionally, after determining whether the number of sentences in the target sentence set is greater than a first threshold, the method further includes:
if the number of sentences in the target sentence set is judged to be less than or equal to a first threshold value, determining the sentence type of the sentence to be compressed, acquiring an extraction rule corresponding to the sentence type from a second database, extracting the sentences from the target sentence set based on the extraction rule, and splicing the extracted sentences to obtain a key sentence.
Optionally, the extracting the stem words of the key sentences includes:
performing word segmentation processing on the key sentence to obtain a third word sequence;
sequentially identifying the part of speech of each word in the third word sequence, determining a syntactic structure of the third word sequence based on the part of speech and a preset syntactic analysis strategy, and extracting the stem word in the third word sequence based on the syntactic structure.
Optionally, after determining whether the sentence length of the key sentence is greater than a second threshold, the method further includes:
and if the sentence length of the key sentence is judged to be less than or equal to a second threshold value, taking the key sentence as a target sentence.
In order to solve the above problem, the present invention also provides a sentence compressing apparatus, comprising:
the analysis module is used for analyzing a statement compression request sent by a user based on a client, acquiring a statement to be compressed carried by the request, executing spoken language removal processing on the statement to be compressed to obtain a target statement set, and judging whether the number of the sentences in the target statement set is greater than a first threshold value or not;
the sorting module is used for sorting the importance of the sentences in the target sentence set when the number of the sentences in the target sentence set is judged to be larger than a first threshold value, extracting key sentences based on a sorting result, and judging whether the sentence length of the key sentences is larger than a second threshold value or not;
and the extraction module is used for extracting the trunk words of the key sentences and splicing the trunk words to obtain the target sentences when the sentence length of the key sentences is judged to be larger than a second threshold value.
Optionally, the sorting the importance of the sentences in the target sentence set and extracting the key sentences based on the sorting result include:
combining each sentence in the target sentence set with other sentences pairwise to obtain a plurality of combination pairs;
calculating similarity values of two sentences of each combination pair in the plurality of combination pairs, and determining a similarity matrix corresponding to the target sentence set based on the similarity values;
and calculating the importance scores of the sentences in the target sentence set based on the similarity matrix, sequencing the sentences in the target sentence set according to the sequence of the importance scores from high to low, and taking the sentence with the highest sequence as a key sentence.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a sentence compression program executable by the at least one processor, the sentence compression program being executed by the at least one processor to enable the at least one processor to perform the above sentence compression method.
In order to solve the above problem, the present invention also provides a computer-readable storage medium having stored thereon a sentence compression program executable by one or more processors to implement the above sentence compression method.
Compared with the prior art, the method has the advantages that firstly, the spoken language removing processing is carried out on the sentence to be compressed to obtain the target sentence set, the spoken language removing processing is carried out, spoken sentences and spoken words without semantic information in the sentence to be compressed are removed, and the preliminary compression of the sentence to be compressed is realized; secondly, when the number of sentences in the target sentence set is judged to be larger than a first threshold value, the sentences in the target sentence set are subjected to importance degree sequencing, key sentences are extracted based on a sequencing result, the key sentences in the sentences to be compressed are extracted through the importance degree sequencing, redundant information is further removed, and semantic information of the sentences to be compressed is reserved; and finally, when the sentence length of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, splicing the trunk words to obtain the target sentence. Therefore, the invention reduces the labeling cost and ensures the semantic accuracy of the compressed statement.
Drawings
Fig. 1 is a schematic flow chart of a sentence compression method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a sentence compressing apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing a statement compression method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The invention provides a statement compression method. Fig. 1 is a schematic flow chart of a sentence compression method according to an embodiment of the present invention. The method may be performed by an electronic device, which may be implemented by software and/or hardware.
In this embodiment, the statement compression method includes:
s1, analyzing a statement compression request sent by a user based on a client, acquiring a to-be-compressed statement carried by the request, executing spoken language removal processing on the to-be-compressed statement to obtain a target statement set, and judging whether the number of sentences in the target statement set is greater than a first threshold value.
The executing the spoken language removal processing on the statement to be compressed includes:
a11, acquiring a spoken sentence dictionary from a first database, comparing each first clause in the sentences to be compressed with the spoken sentence dictionary, and deleting a specified first clause if a specified first clause is matched with one sentence in the spoken sentence dictionary to obtain an initial sentence set;
a12, performing word segmentation processing on the sentences in the initial sentence set to obtain a first word sequence;
in this embodiment, the sentences in the initial sentence set may be segmented by using a statistical probability model or/and a segmentation method based on an N-gram language model.
A13, recognizing the spoken words in the first word sequence based on a spoken word recognition model, and deleting the spoken words to obtain a second word sequence;
in this embodiment, the spoken language word recognition model is a deep neural network model, and the deep neural network model recognizes part-of-speech tags of each word in the first word sequence, and rejects spoken languages (such as language, spoken words, and the like) based on the part-of-speech tags.
A14, splicing the words in the second word sequence according to the positions of the words in the sentence to be compressed to obtain a plurality of second clauses, and taking the set of the second clauses as a target sentence set.
In this embodiment, the sentence to be compressed is a long sentence composed of a plurality of sentences, and the spoken sentence dictionary stores a plurality of spoken sentences without semantic information.
For example, if the sentence to be compressed is "quota", i know that i can borrow. I send wages 15 a month later. In addition, I can only wait for No. 15 payroll. Payroll was issued 15 times per month. There is no way. "
And taking a set of sentences left after the saliva sentence "these sentences are known by me" and the mood word "amount" is removed as a target sentence set.
S2, when the number of sentences in the target sentence set is judged to be larger than a first threshold value, the sentences in the target sentence set are subjected to importance ranking, key sentences are extracted based on the ranking result, and whether the sentence length of the key sentences is larger than a second threshold value is judged.
In this embodiment, the first threshold may be 5.
The sorting the importance of the sentences in the target sentence set and extracting the key sentences based on the sorting result comprises the following steps:
b11, combining each sentence in the target sentence set with other sentences pairwise respectively to obtain a plurality of combination pairs;
b12, calculating similarity values of two sentences of each combination pair in the plurality of combination pairs, and determining a similarity matrix corresponding to the target sentence set based on the similarity values;
b13, calculating the importance scores of the sentences in the target sentence set based on the similarity matrix, sequencing the sentences in the target sentence set according to the sequence of the importance scores from high to low, and taking the sentence with the highest sequence as a key sentence.
Assume that the similarity values corresponding to each combination pair in the target sentence set are shown in table 1 below:
similarity value Sentence 1 Sentence 2 Sentence 3
Sentence 1 1 0.63 0.44
Sentence 2 0.63 1 0.78
Sentence 3 0.44 0.78 1
TABLE 1
Then the similarity matrix corresponding to the target sentence set is
Figure BDA0002809643850000061
The calculation formula of the importance score is as follows:
wi=(1-d)+d*s*wi′
wherein, wiThe importance value of the ith sentence in the target sentence set, d is a damping coefficient (the value range is 0-1, the value is generally 0.85), s is a similarity matrix corresponding to the target sentence set, and w isi′And obtaining the importance value of the ith sentence in the target sentence set in the previous iteration.
In this embodiment, the initial importance score of each sentence is 1, the final importance score of each sentence is calculated by iterative propagation according to the above importance score calculation formula, and convergence is achieved when the error rate of any one sentence is less than a given limit value (e.g., 0.0001).
In this embodiment, the similarity value of two sentences in each combination pair can be calculated by using a cosine similarity, an euclidean distance, a manhattan distance, and a minkowski distance algorithm.
After determining whether the number of sentences in the target sentence set is greater than a first threshold, the method further comprises:
if the number of sentences in the target sentence set is judged to be less than or equal to a first threshold value, determining the sentence type of the sentence to be compressed, acquiring an extraction rule corresponding to the sentence type from a second database, extracting the sentences from the target sentence set based on the extraction rule, and splicing the extracted sentences to obtain a key sentence.
In this embodiment, when the number of sentences in the target sentence set is less than a first threshold (e.g., 5), the sentence types of the sentence to be compressed are determined, where the sentence types include a question sentence pattern, an answer sentence pattern, and a statement sentence pattern, and the second database stores extraction rules corresponding to various sentence types in advance, for example, the extraction rule corresponding to the question sentence pattern may be to extract two sentences of the target sentence set located at the end of the sentence to be compressed, the extraction rule corresponding to the answer sentence pattern may be to extract two sentences of the target sentence set located at the beginning and the end of the sentence to be compressed, and the statement sentence pattern may be to extract three sentences of the target sentence set located at the beginning and the end and the middle of the sentence to be compressed.
In this embodiment, the extraction rule is not limited, and the user may set the corresponding extraction rule according to a specific scenario.
And splicing the extracted sentences according to the sequence of the sentences in the sentences to be compressed to obtain the key sentences.
And S3, when the sentence size of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, and splicing the trunk words to obtain the target sentence.
In this embodiment, the extracting the stem words of the key sentence includes:
c11, performing word segmentation processing on the key sentence to obtain a third word sequence;
and C12, sequentially identifying the part of speech of each word in the third word sequence, determining the syntactic structure of the third word sequence based on the part of speech and a preset syntactic analysis strategy, and extracting the stem words in the third word sequence based on the syntactic structure.
In this embodiment, the parts of speech include nouns, verbs, adjectives, prepositions, negative words, adverbs, auxiliary words, and the like.
The preset syntactic analysis strategy is dependency syntactic analysis, and the determining of the syntactic structure of the third word sequence based on the part of speech and the preset syntactic analysis strategy comprises:
d11, determining a core word in the third word sequence based on the part of speech;
usually, the verb is a core word (there is usually only one verb in a sentence).
D12, determining the membership among the words in the third word sequence;
for example, if the key statement is: i eat a big apple, then the third word sequence is { I, eat, one, big, apple }, and its core word is "eat".
When the dependency relationship is analyzed, "one" belongs to "apple" and "big" also belongs to "apple".
D13, determining the syntactic structure of the third word sequence according to the core words and the affiliations.
The syntax structure includes: a cardinal relationship, a dynamic guest relationship, an inter-guest relationship, a preposition guest, a centering relationship, a structure in shape, a dynamic complement structure, a parallel relationship, a mediate guest relationship, a left addition relationship, and a right addition relationship.
The third word sequence { i, eat, one, large, apple } corresponds to a syntactic structure of { 2: major-minor relationship, 6: moving guest relationship, 2: dynamic complement relationship, 6, centering relationship, 6: centering relationship, 2: move guest relationship }.
The stem words extracted according to the syntax structure are 'I', 'eat' and 'apple'.
In this embodiment, after determining whether the sentence length of the key sentence is greater than a second threshold, the method further includes:
and if the sentence length of the key sentence is judged to be less than or equal to a second threshold value, taking the key sentence as a target sentence.
According to the embodiment, the statement compression method provided by the invention comprises the steps of firstly, executing the spoken language removal processing on a statement to be compressed to obtain a target statement set, and eliminating spoken sentences and spoken words without semantic information in the statement to be compressed by executing the spoken language removal processing to realize the preliminary compression of the statement to be compressed; secondly, when the number of sentences in the target sentence set is judged to be larger than a first threshold value, the sentences in the target sentence set are subjected to importance degree sequencing, key sentences are extracted based on a sequencing result, the key sentences in the sentences to be compressed are extracted through the importance degree sequencing, redundant information is further removed, and semantic information of the sentences to be compressed is reserved; and finally, when the sentence length of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, splicing the trunk words to obtain the target sentence. Therefore, the invention reduces the labeling cost and ensures the semantic accuracy of the compressed statement.
Fig. 2 is a block diagram of a sentence compressing apparatus according to an embodiment of the present invention.
The sentence compressing apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the sentence compressing apparatus 100 may include a parsing module 110, an ordering module 120, and an extracting module 130. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the parsing module 110 is configured to parse a statement compression request sent by a user based on a client, obtain a to-be-compressed statement carried by the request, perform a spoken language removal process on the to-be-compressed statement to obtain a target statement set, and determine whether the number of sentences in the target statement set is greater than a first threshold.
The executing the spoken language removal processing on the statement to be compressed includes:
a21, acquiring a spoken sentence dictionary from a first database, comparing each first clause in the sentences to be compressed with the spoken sentence dictionary, and deleting a specified first clause if a specified first clause is matched with one sentence in the spoken sentence dictionary to obtain an initial sentence set;
a22, performing word segmentation processing on the sentences in the initial sentence set to obtain a first word sequence;
in this embodiment, the sentences in the initial sentence set may be segmented by using a statistical probability model or/and a segmentation method based on an N-gram language model.
A23, recognizing the spoken words in the first word sequence based on a spoken word recognition model, and deleting the spoken words to obtain a second word sequence;
in this embodiment, the spoken language word recognition model is a deep neural network model, and the deep neural network model recognizes part-of-speech tags of each word in the first word sequence, and rejects spoken languages (such as language, spoken words, and the like) based on the part-of-speech tags.
A24, splicing the words in the second word sequence according to the positions of the words in the sentence to be compressed to obtain a plurality of second clauses, and taking the set of the second clauses as a target sentence set.
In this embodiment, the sentence to be compressed is a long sentence composed of a plurality of sentences, and the spoken sentence dictionary stores a plurality of spoken sentences without semantic information.
For example, if the sentence to be compressed is "quota", i know that i can borrow. I send wages 15 a month later. In addition, I can only wait for No. 15 payroll. Payroll was issued 15 times per month. There is no way. "
And taking a set of sentences left after the saliva sentence "these sentences are known by me" and the mood word "amount" is removed as a target sentence set.
The sorting module 120 is configured to, when it is determined that the number of sentences in the target sentence set is greater than a first threshold, sort the importance of the sentences in the target sentence set, extract key sentences based on a sorting result, and determine whether a sentence length of the key sentences is greater than a second threshold.
In this embodiment, the first threshold may be 5.
The sorting the importance of the sentences in the target sentence set and extracting the key sentences based on the sorting result comprises the following steps:
b21, combining each sentence in the target sentence set with other sentences pairwise respectively to obtain a plurality of combination pairs;
b22, calculating similarity values of two sentences of each combination pair in the plurality of combination pairs, and determining a similarity matrix corresponding to the target sentence set based on the similarity values;
b23, calculating the importance scores of the sentences in the target sentence set based on the similarity matrix, sequencing the sentences in the target sentence set according to the sequence of the importance scores from high to low, and taking the sentence with the highest sequence as a key sentence.
Assume that the similarity values corresponding to the respective combination pairs in the target sentence set are as shown in table 1 above.
Then the similarity matrix corresponding to the target sentence set is
Figure BDA0002809643850000091
The calculation formula of the importance score is as follows:
wi=(1-d)+d*s*wi′
wherein, wiThe importance value of the ith sentence in the target sentence set, d is a damping coefficient (the value range is 0-1, the value is generally 0.85), s is a similarity matrix corresponding to the target sentence set, and w isi′And obtaining the importance value of the ith sentence in the target sentence set in the previous iteration.
In this embodiment, the initial importance score of each sentence is 1, the final importance score of each sentence is calculated by iterative propagation according to the above importance score calculation formula, and convergence is achieved when the error rate of any one sentence is less than a given limit value (e.g., 0.0001).
In this embodiment, the similarity value of two sentences in each combination pair can be calculated by using a cosine similarity, an euclidean distance, a manhattan distance, and a minkowski distance algorithm.
After determining whether the number of sentences in the target sentence set is greater than the first threshold, the sorting module 120 is further configured to:
if the number of sentences in the target sentence set is judged to be less than or equal to a first threshold value, determining the sentence type of the sentence to be compressed, acquiring an extraction rule corresponding to the sentence type from a second database, extracting the sentences from the target sentence set based on the extraction rule, and splicing the extracted sentences to obtain a key sentence.
In this embodiment, when the number of sentences in the target sentence set is less than a first threshold (e.g., 5), the sentence types of the sentence to be compressed are determined, where the sentence types include a question sentence pattern, an answer sentence pattern, and a statement sentence pattern, and the second database stores extraction rules corresponding to various sentence types in advance, for example, the extraction rule corresponding to the question sentence pattern may be to extract two sentences of the target sentence set located at the end of the sentence to be compressed, the extraction rule corresponding to the answer sentence pattern may be to extract two sentences of the target sentence set located at the beginning and the end of the sentence to be compressed, and the statement sentence pattern may be to extract three sentences of the target sentence set located at the beginning and the end and the middle of the sentence to be compressed.
In this embodiment, the extraction rule is not limited, and the user may set the corresponding extraction rule according to a specific scenario.
And splicing the extracted sentences according to the sequence of the sentences in the sentences to be compressed to obtain the key sentences.
And the extracting module 130 is configured to, when it is determined that the sentence length of the key sentence is greater than a second threshold, extract the trunk words of the key sentence, and splice the trunk words to obtain the target sentence.
In this embodiment, the extracting the stem words of the key sentence includes:
c21, performing word segmentation processing on the key sentence to obtain a third word sequence;
and C22, sequentially identifying the part of speech of each word in the third word sequence, determining the syntactic structure of the third word sequence based on the part of speech and a preset syntactic analysis strategy, and extracting the stem words in the third word sequence based on the syntactic structure.
In this embodiment, the parts of speech include nouns, verbs, adjectives, prepositions, negative words, adverbs, auxiliary words, and the like.
The preset syntactic analysis strategy is dependency syntactic analysis, and the determining of the syntactic structure of the third word sequence based on the part of speech and the preset syntactic analysis strategy comprises:
d21, determining a core word in the third word sequence based on the part of speech;
usually, the verb is a core word (there is usually only one verb in a sentence).
D22, determining the membership among the words in the third word sequence;
for example, if the key statement is: i eat a big apple, then the third word sequence is { I, eat, one, big, apple }, and its core word is "eat".
When the dependency relationship is analyzed, "one" belongs to "apple" and "big" also belongs to "apple".
D23, determining the syntactic structure of the third word sequence according to the core words and the affiliations.
The syntax structure includes: a cardinal relationship, a dynamic guest relationship, an inter-guest relationship, a preposition guest, a centering relationship, a structure in shape, a dynamic complement structure, a parallel relationship, a mediate guest relationship, a left addition relationship, and a right addition relationship.
The third word sequence { i, eat, one, large, apple } corresponds to a syntactic structure of { 2: major-minor relationship, 6: moving guest relationship, 2: dynamic complement relationship, 6, centering relationship, 6: centering relationship, 2: move guest relationship }.
The stem words extracted according to the syntax structure are 'I', 'eat' and 'apple'.
In this embodiment, after determining whether the sentence length of the key sentence is greater than a second threshold, the extracting module 130 is further configured to:
and if the sentence length of the key sentence is judged to be less than or equal to a second threshold value, taking the key sentence as a target sentence.
It can be seen from the foregoing embodiment that, in the sentence compressing apparatus 100 provided by the present invention, firstly, the spoken language removal processing is performed on the sentence to be compressed to obtain the target sentence set, and the spoken language removal processing is performed to remove spoken sentences and spoken words without semantic information in the sentence to be compressed, so as to achieve the preliminary compression of the sentence to be compressed; secondly, when the number of sentences in the target sentence set is judged to be larger than a first threshold value, the sentences in the target sentence set are subjected to importance degree sequencing, key sentences are extracted based on a sequencing result, the key sentences in the sentences to be compressed are extracted through the importance degree sequencing, redundant information is further removed, and semantic information of the sentences to be compressed is reserved; and finally, when the sentence length of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, splicing the trunk words to obtain the target sentence. Therefore, the invention reduces the labeling cost and ensures the semantic accuracy of the compressed statement.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a statement compression method according to an embodiment of the present invention.
The electronic device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. The electronic device 1 may be a computer, or may be a single network server, a server group composed of a plurality of network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers.
In the embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, where the memory 11 stores a statement compression program 10, and the statement compression program 10 is executable by the processor 12. Fig. 3 only shows the electronic device 1 with the components 11-13 and the sentence compression program 10, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
The storage 11 includes a memory and at least one type of readable storage medium. The memory provides cache for the operation of the electronic equipment 1; the readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk provided on the electronic device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various application software installed in the electronic device 1, for example, code of the statement compression program 10 in an embodiment of the present invention. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the statement compression program 10.
The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is used for establishing a communication connection between the electronic device 1 and a client (not shown).
Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The statement compression program 10 stored in the memory 11 of the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 12, can implement:
analyzing a statement compression request sent by a user based on a client, acquiring a to-be-compressed statement carried by the request, performing spoken language removal processing on the to-be-compressed statement to obtain a target statement set, and judging whether the number of sentences in the target statement set is greater than a first threshold value or not;
when the number of sentences in the target sentence set is judged to be larger than a first threshold value, sorting the importance of the sentences in the target sentence set, extracting key sentences based on a sorting result, and judging whether the sentence length of the key sentences is larger than a second threshold value;
and when the sentence length of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, and splicing the trunk words to obtain the target sentence.
Specifically, the processor 12 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the statement compression program 10, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or non-volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The computer-readable storage medium stores a sentence compression program 10, and the sentence compression program 10 can be executed by one or more processors, and the specific implementation of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the sentence compression method, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of sentence compression, the method comprising:
analyzing a statement compression request sent by a user based on a client, acquiring a to-be-compressed statement carried by the request, performing spoken language removal processing on the to-be-compressed statement to obtain a target statement set, and judging whether the number of sentences in the target statement set is greater than a first threshold value or not;
when the number of sentences in the target sentence set is judged to be larger than a first threshold value, sorting the importance of the sentences in the target sentence set, extracting key sentences based on a sorting result, and judging whether the sentence length of the key sentences is larger than a second threshold value;
and when the sentence length of the key sentence is judged to be larger than a second threshold value, extracting the trunk words of the key sentence, and splicing the trunk words to obtain the target sentence.
2. The sentence compression method of claim 1, wherein the performing of the spoken language removal processing on the sentence to be compressed comprises:
acquiring a spoken sentence dictionary from a first database, comparing each first clause in the sentences to be compressed with the spoken sentence dictionary, and deleting a specified first clause if one specified first clause is matched with one sentence in the spoken sentence dictionary to obtain an initial sentence set;
performing word segmentation processing on the sentences in the initial sentence set to obtain a first word sequence;
recognizing the spoken words in the first word sequence based on a spoken word recognition model, and deleting the spoken words to obtain a second word sequence;
and splicing the words in the second word sequence according to the positions of the words in the sentence to be compressed to obtain a plurality of second clauses, and taking the set of the second clauses as a target sentence set.
3. The sentence compression method of claim 1, wherein the sorting of the importance of the sentences in the target sentence set and the extracting of the key sentences based on the sorting result comprises:
combining each sentence in the target sentence set with other sentences pairwise to obtain a plurality of combination pairs;
calculating similarity values of two sentences of each combination pair in the plurality of combination pairs, and determining a similarity matrix corresponding to the target sentence set based on the similarity values;
and calculating the importance scores of the sentences in the target sentence set based on the similarity matrix, sequencing the sentences in the target sentence set according to the sequence of the importance scores from high to low, and taking the sentence with the highest sequence as a key sentence.
4. The sentence compression method of claim 1 wherein after determining whether the number of sentences in the target sentence set is greater than a first threshold, the method further comprises:
if the number of sentences in the target sentence set is judged to be less than or equal to a first threshold value, determining the sentence type of the sentence to be compressed, acquiring an extraction rule corresponding to the sentence type from a second database, extracting the sentences from the target sentence set based on the extraction rule, and splicing the extracted sentences to obtain a key sentence.
5. The sentence compression method of claim 1, wherein the extracting stem words of the key sentence comprises:
performing word segmentation processing on the key sentence to obtain a third word sequence;
sequentially identifying the part of speech of each word in the third word sequence, determining a syntactic structure of the third word sequence based on the part of speech and a preset syntactic analysis strategy, and extracting the stem word in the third word sequence based on the syntactic structure.
6. The sentence compression method of any one of claims 1-5, wherein after determining whether the sentence length of the key sentence is greater than a second threshold, the method further comprises:
and if the sentence length of the key sentence is judged to be less than or equal to a second threshold value, taking the key sentence as a target sentence.
7. A sentence compression apparatus, the apparatus comprising:
the analysis module is used for analyzing a statement compression request sent by a user based on a client, acquiring a statement to be compressed carried by the request, executing spoken language removal processing on the statement to be compressed to obtain a target statement set, and judging whether the number of the sentences in the target statement set is greater than a first threshold value or not;
the sorting module is used for sorting the importance of the sentences in the target sentence set when the number of the sentences in the target sentence set is judged to be larger than a first threshold value, extracting key sentences based on a sorting result, and judging whether the sentence length of the key sentences is larger than a second threshold value or not;
and the extraction module is used for extracting the trunk words of the key sentences and splicing the trunk words to obtain the target sentences when the sentence length of the key sentences is judged to be larger than a second threshold value.
8. The sentence compression apparatus of claim 7 wherein the ranking of the importance of the sentences in the target sentence set and the extracting of the key sentences based on the ranking result comprises:
combining each sentence in the target sentence set with other sentences pairwise to obtain a plurality of combination pairs;
calculating similarity values of two sentences of each combination pair in the plurality of combination pairs, and determining a similarity matrix corresponding to the target sentence set based on the similarity values;
and calculating the importance scores of the sentences in the target sentence set based on the similarity matrix, sequencing the sentences in the target sentence set according to the sequence of the importance scores from high to low, and taking the sentence with the highest sequence as a key sentence.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a sentence compression program executable by the at least one processor, the sentence compression program being executable by the at least one processor to enable the at least one processor to perform the sentence compression method of any of claims 1-6.
10. A computer-readable storage medium having stored thereon a sentence compression program executable by one or more processors to implement the sentence compression method of any of claims 1-6.
CN202011386421.4A 2020-12-01 2020-12-01 Statement compression method and device, electronic equipment and readable storage medium Pending CN112434515A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011386421.4A CN112434515A (en) 2020-12-01 2020-12-01 Statement compression method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011386421.4A CN112434515A (en) 2020-12-01 2020-12-01 Statement compression method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112434515A true CN112434515A (en) 2021-03-02

Family

ID=74697605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011386421.4A Pending CN112434515A (en) 2020-12-01 2020-12-01 Statement compression method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112434515A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989058A (en) * 2015-02-06 2016-10-05 北京中搜网络技术股份有限公司 Chinese news brief generating system and method
CN107451139A (en) * 2016-05-30 2017-12-08 北京三星通信技术研究有限公司 File resource methods of exhibiting, device and corresponding smart machine
CN108470026A (en) * 2018-03-23 2018-08-31 北京奇虎科技有限公司 The sentence trunk method for extracting content and device of headline
CN108829894A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Spoken word identification and method for recognizing semantics and its device
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN111444703A (en) * 2020-03-04 2020-07-24 中国平安人寿保险股份有限公司 Statement compression method, device, equipment and computer readable storage medium
US20200312297A1 (en) * 2019-03-28 2020-10-01 Wipro Limited Method and device for extracting factoid associated words from natural language sentences

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989058A (en) * 2015-02-06 2016-10-05 北京中搜网络技术股份有限公司 Chinese news brief generating system and method
CN107451139A (en) * 2016-05-30 2017-12-08 北京三星通信技术研究有限公司 File resource methods of exhibiting, device and corresponding smart machine
CN108470026A (en) * 2018-03-23 2018-08-31 北京奇虎科技有限公司 The sentence trunk method for extracting content and device of headline
CN108829894A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Spoken word identification and method for recognizing semantics and its device
US20200312297A1 (en) * 2019-03-28 2020-10-01 Wipro Limited Method and device for extracting factoid associated words from natural language sentences
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN111444703A (en) * 2020-03-04 2020-07-24 中国平安人寿保险股份有限公司 Statement compression method, device, equipment and computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘文锋: "基于表示学习和依存句法的自动文本摘要方法研究", 《中国博士学位论文全文数据库 信息科技辑》, 15 August 2020 (2020-08-15) *
吴仁守: "基于文本结构信息的短文本摘要生成研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 April 2020 (2020-04-15), pages 2 *
吴玉林: "基于主题模型的多文档自动文摘技术的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 June 2020 (2020-06-15), pages 9 - 11 *

Similar Documents

Publication Publication Date Title
CN111581976B (en) Medical term standardization method, device, computer equipment and storage medium
CN111460787B (en) Topic extraction method, topic extraction device, terminal equipment and storage medium
JP5936698B2 (en) Word semantic relation extraction device
WO2022121171A1 (en) Similar text matching method and apparatus, and electronic device and computer storage medium
CN112541056B (en) Medical term standardization method, device, electronic equipment and storage medium
WO2022078308A1 (en) Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN108427702B (en) Target document acquisition method and application server
JP2002215619A (en) Translation sentence extracting method from translated document
US11170169B2 (en) System and method for language-independent contextual embedding
CN114330335B (en) Keyword extraction method, device, equipment and storage medium
CN112329460A (en) Text topic clustering method, device, equipment and storage medium
CN107577663A (en) A kind of key-phrase extraction method and apparatus
CN111177375A (en) Electronic document classification method and device
US20130024403A1 (en) Automatically induced class based shrinkage features for text classification
CN114220505A (en) Information extraction method of medical record data, terminal equipment and readable storage medium
WO2019085118A1 (en) Topic model-based associated word analysis method, and electronic apparatus and storage medium
EP3425531A1 (en) System, method, electronic device, and storage medium for identifying risk event based on social information
CN114118072A (en) Document structuring method and device, electronic equipment and computer readable storage medium
CN109241281B (en) Software failure reason generation method, device and equipment
CN114969385B (en) Knowledge graph optimization method and device based on document attribute assignment entity weight
WO2022141860A1 (en) Text deduplication method and apparatus, electronic device, and computer readable storage medium
CN114398877A (en) Theme extraction method and device based on artificial intelligence, electronic equipment and medium
CN112434515A (en) Statement compression method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination