WO2022262266A1 - Procédé et appareil de génération de résumé de texte, dispositif informatique et support de stockage - Google Patents

Procédé et appareil de génération de résumé de texte, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2022262266A1
WO2022262266A1 PCT/CN2022/071791 CN2022071791W WO2022262266A1 WO 2022262266 A1 WO2022262266 A1 WO 2022262266A1 CN 2022071791 W CN2022071791 W CN 2022071791W WO 2022262266 A1 WO2022262266 A1 WO 2022262266A1
Authority
WO
WIPO (PCT)
Prior art keywords
clauses
clause
sentence
similarity
recommended
Prior art date
Application number
PCT/CN2022/071791
Other languages
English (en)
Chinese (zh)
Inventor
李夏昕
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022262266A1 publication Critical patent/WO2022262266A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to a text abstract generation method, device, computer equipment and storage medium.
  • Text summarization technology is an important technology in the field of artificial intelligence. For humans, it is an innate ability to read a long text and extract its core summary content. But for computers, it represents the most challenging technological progress and breakthrough in the field of artificial intelligence.
  • the Internet in today's world carries a large amount of text information, including a large number of medium and long texts. Understanding these texts through machines and extracting core summaries can support various application functions that are beneficial to human society, such as: media monitoring, search engine marketing and optimization, financial and legal text analysis research, social media marketing, books and Document content indexing, video conferencing abstracts, automatic content authoring, and more.
  • the unsupervised extraction scheme is most commonly used in the implementation of text summarization technology in the industry.
  • the specific methods include graph-based, topic-based model-based, centrality-based and information redundancy-based methods.
  • the graph-based TextRank algorithm is the most classic and widely used method.
  • the TextRank algorithm has good versatility and is suitable for texts in various fields as well as medium-length and long-length texts, but it also has some defects: (1) In the TextRank algorithm, the edge connecting two graph nodes is a single undirected edge, and this edge has only Single weight, from the perspective of this single undirected edge, the weights of the node sentences at both ends are equal.
  • any two sentences in the article are compared separately, their importance should also be divided into high and low points;
  • any two nodes in the graph have a connecting edge, which is equivalent to taking all the sentences in the article Sentences are mixed together and modeled, without considering the neighbor relationship of sentences and their position in the original article.
  • the position of the sentence and the context of the sentence play an important role in the judgment of the summary sentence.
  • the sentence at the beginning and end of the article or paragraph, as well as the summary sentence, are likely to be a summary sentence;
  • TextRank When the algorithm calculates the weights of the edges in the graph, it only considers the plain text similarity between two sentences, and does not consider the semantic similarity, that is, it does not consider the situation that the text is written differently but has similar semantics;
  • TextRank algorithm is calculating In the pure text similarity, the importance of different entries is not distinguished, and the unimportant words are not filtered out according to the part of speech, so the accuracy of the pure text similarity calculation needs to be improved.
  • Embodiments of the present application provide a text abstract generation method, device, computer equipment, and storage medium, which can realize more accurate text abstract generation based on artificial intelligence means.
  • the embodiment of the present application provides a method for generating a text abstract, which includes:
  • the graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
  • the embodiment of the present application provides a text abstract generating device, which includes:
  • An acquisition unit configured to respond to a text summary generation instruction, and obtain data to be processed according to the text summary generation instruction
  • a segmentation unit configured to segment the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses
  • a calculation unit configured to calculate the mutual recommendation degree between every two clauses in the plurality of clauses
  • the calculation unit is also used to calculate the semantic similarity between every two clauses in the plurality of clauses;
  • the calculation unit is also used to calculate the positional similarity between every two clauses in the plurality of clauses;
  • the fusion unit is used to fuse the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses to obtain a graph adjacency matrix;
  • the calculation unit is also used to input the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause;
  • a screening unit is used to screen according to the importance of each clause to obtain alternative clauses
  • a post-processing unit configured to post-process the candidate clauses to obtain a summary sentence.
  • the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program.
  • a computer device which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program. The following steps are implemented in the program:
  • the graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
  • the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor performs the following steps :
  • the graph adjacency matrix is input to the importance of TextRank algorithm to calculate each clause;
  • the embodiment of the present application provides a text summary generation method, device, computer equipment and storage medium, which can effectively overcome the traditional TextRank algorithm, which does not distinguish the importance of different entries when calculating the similarity of plain text, and does not filter out by part of speech
  • the defect of unimportant words improves the possibility of sentences with strong business relevance being selected as summaries.
  • the relationship between the front and back of sentences and their positions in the original article are fully considered, which effectively overcomes the problems in traditional methods.
  • post-processing is added on the basis of traditional text summaries, and the summarization results obtained by the graph algorithm are corrected to improve the final output.
  • the quality of abstracts, and then based on artificial intelligence means to achieve more accurate text summary generation.
  • FIG. 1 is a schematic flow diagram of a method for generating a text abstract provided in an embodiment of the present application
  • FIG. 2 is a schematic block diagram of a text abstract generation device provided by an embodiment of the present application.
  • Fig. 3 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for generating a text abstract provided in an embodiment of the present application.
  • the text summary generation instruction may be triggered by relevant staff, such as media monitors, online educators, and the like.
  • said obtaining the data to be processed according to said text summary generation instruction includes:
  • Link to the target address, and acquire the data stored at the target address as the data to be processed.
  • the target address may include, but not limited to: a web page address, a folder address, a database address, and the like.
  • the data to be processed is directly extracted.
  • the data to be processed can be obtained directly from the text summary generation instruction.
  • the dictionary obtained according to the task scene performs segmentation processing on the data to be processed, and obtains a plurality of clauses including:
  • the data to be processed can be segmented according to the punctuation mark of the sentence, such as a period, a question mark, an exclamation mark, and the like.
  • a word segmentation tool (such as a Chinese word segmentation tool) can be used to load the target dictionary, so as to segment business-related entries well.
  • sentences can be segmented according to specific dictionaries associated with specific task scenarios, so as to better segment business-related entries.
  • the calculating the mutual recommendation degree between every two clauses in the plurality of clauses includes:
  • L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
  • the configuration requirement may be uploaded by the user.
  • mat t (Si, Sj) represents the mutual recommendation between any two clauses Si and Sj
  • Wk represents the words that appear simultaneously in Si and Sj
  • TermWeight represents the weight of the word
  • Tk represents the part of speech of Wk
  • valid_postags represents valid part of speech.
  • the effective parts of speech include nouns, verbs, adjectives, and adverbs, which are closely related to sentence semantics.
  • weights are assigned to business words with different importance.
  • the weight of a product name entry can be twice that of a general entry
  • the weight of a disease name or competing product company name entry can be 1.5 times that of a general entry.
  • the specific weight value can be determined by performing a parameter search based on the regression test effect. get.
  • L2 is a regularization term, also called a penalty term. It is an item added after the loss function to limit the parameters of the model and prevent the model from overfitting.
  • the L2 norm conforms to the Gaussian distribution and is completely differentiable.
  • the calculating the semantic similarity between every two clauses in the plurality of clauses includes:
  • the cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
  • mat s (Si, Sj) represents the semantic similarity between any two clauses Si and Sj
  • s i -embed represents the embedded vector representation of clause Si
  • s j -embed represents the embedded vector representation of clause Sj
  • cosine_similarity means to solve the cosine similarity
  • the calculating the position similarity between every two clauses in the plurality of clauses includes:
  • the corresponding matrix cell value is the third value
  • Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
  • the first numerical value, the second numerical value, the third numerical value and the fourth numerical value can be customized.
  • the first numerical value can be configured as 2, so The second value is 1.5, the third value is 2.5, and the fourth value is 1.
  • the front preset position or the rear preset position can also be customized, for example, the front preset position can be configured as the first 5%, and correspondingly, the rear preset position can be configured as the rear 5%.
  • the specified attribute may be a summary attribute, that is, the sentence with the specified attribute is a summary sentence.
  • the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused using the following formula , to get the graph adjacency matrix:
  • mat adjc represents the graph adjacency matrix
  • mat t represents the mutual recommendation between each two clauses
  • mat s represents the semantic similarity between each two clauses
  • mat o represents the position between each two clauses Similarity
  • represents the weight of the mutual recommendation
  • represents the weight of the semantic similarity
  • ⁇ >0, ⁇ >0, and ⁇ + ⁇ 1.
  • the edge connecting the nodes of the two graphs is a single undirected edge, and this edge has only a single weight. From the perspective of this single undirected edge, the weights of the sentences at both ends of the node are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low. It is obviously wrong to treat the importance of the two sentences as equivalent.
  • the resulting graph adjacency models the graph node connection from a single undirected edge to two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme.
  • the TextRank value of each node is iteratively calculated as the importance of each corresponding clause, which will not be repeated here.
  • the screening is performed according to the importance of each clause, and the alternative clauses obtained include:
  • a clause whose importance is greater than or equal to the preset threshold is acquired as the candidate clause.
  • the preset threshold can be customized, such as 95%.
  • the screening according to the importance of each clause to obtain the alternative clauses also includes:
  • the clauses arranged before the preset position are determined as the candidate clauses.
  • the preset positions can be customized, such as 20 positions.
  • the preset position is equivalent to a hyperparameter, which can be obtained through experiments or debugging.
  • the rouge value of the abstract is used as an index to perform a hyperparameter search on the preset position, and the optimal rouge value is selected to correspond to The value of is used as the preset position.
  • the post-processing of the candidate clauses to obtain a summary sentence includes:
  • clauses may include, but not limited to: interrogative sentences and sentences composed of associated phrases.
  • the type of each clause in the candidate clauses may be judged according to keywords or symbols obtained through text recognition. For example: when "?" is recognized, it is judged as an interrogative sentence.
  • a summary sentence is a question sentence, usually the next adjacent sentence should also be judged as a summary; a summary sentence is "although...but", "because...so" in such sentences When one constituent sentence is used, the other half of the constituent sentence should usually be judged as an abstract as well.
  • post-processing is added on the basis of traditional text summarization, and the summarization result obtained by the graph algorithm is corrected to improve the quality of the final output summarization.
  • the summary sentences can be stored on the blockchain nodes.
  • the present application can respond to the text summary generation instruction, obtain the data to be processed according to the text summary generation instruction, and perform segmentation processing on the data to be processed to obtain multiple clauses.
  • the specific dictionary associated with the task scenario performs sentence segmentation in order to better segment out business-related entries, calculate the mutual recommendation between each two clauses in the multiple clauses, and only retain nouns , verbs, adjectives and adverbs are four parts of speech that are closely related to sentence semantics, and different weights are given to business words with different importance when calculating common word scores.
  • FIG. 2 is a schematic block diagram of an apparatus for generating a text summary provided by an embodiment of the present application.
  • the device 100 for generating a text summary includes: an acquisition unit 101 , a segmentation unit 102 , a calculation unit 103 , a fusion unit 104 , a screening unit 105 , and a post-processing unit 106 .
  • the obtaining unit 101 obtains the data to be processed according to the text summary generation instruction.
  • the text summary generation instruction may be triggered by relevant staff, such as media monitors, online educators, and the like.
  • the acquiring unit 101 acquiring the data to be processed according to the text summary generation instruction includes:
  • Link to the target address, and acquire the data stored at the target address as the data to be processed.
  • the target address may include, but not limited to: a web page address, a folder address, a database address, and the like.
  • the data to be processed is directly extracted.
  • the data to be processed can be obtained directly from the text summary generation instruction.
  • the segmentation unit 102 performs segmentation processing on the data to be processed according to the task scene acquisition dictionary to obtain multiple clauses.
  • the segmentation unit 102 performs segmentation processing on the data to be processed according to the task scene acquisition dictionary, and obtains a plurality of clauses including:
  • the data to be processed can be segmented according to the punctuation mark of the sentence, such as a period, a question mark, an exclamation mark, and the like.
  • a word segmentation tool (such as a Chinese word segmentation tool) can be used to load the target dictionary, so as to segment business-related entries well.
  • sentences can be segmented according to specific dictionaries associated with specific task scenarios, so as to better segment business-related entries.
  • the calculation unit 103 calculates the degree of mutual recommendation between every two clauses in the plurality of clauses.
  • the calculating unit 103 calculating the mutual recommendation degree between every two clauses in the plurality of clauses includes:
  • L2 regularization is performed on the recommendation degree matrix to obtain the mutual recommendation degree between every two clauses.
  • the configuration requirement may be uploaded by the user.
  • mat t (Si, Sj) represents the mutual recommendation between any two clauses Si and Sj
  • Wk represents the words that appear simultaneously in Si and Sj
  • TermWeight represents the weight of the word
  • Tk represents the part of speech of Wk
  • valid_postags represents valid part of speech.
  • the effective parts of speech include nouns, verbs, adjectives, and adverbs, which are closely related to sentence semantics.
  • weights are assigned to business words with different importance.
  • the weight of a product name entry can be twice that of a general entry
  • the weight of a disease name or competing product company name entry can be 1.5 times that of a general entry.
  • the specific weight value can be determined by performing a parameter search based on the regression test effect. get.
  • L2 is a regularization term, also called a penalty term. It is an item added after the loss function to limit the parameters of the model and prevent the model from overfitting.
  • the L2 norm conforms to the Gaussian distribution and is completely differentiable.
  • the calculation unit 103 calculates the semantic similarity between every two clauses in the plurality of clauses.
  • the calculating unit 103 calculating the semantic similarity between every two clauses in the plurality of clauses includes:
  • the cosine similarity between every two clauses is determined as the semantic similarity between every two clauses.
  • mat s (Si, Sj) represents the semantic similarity between any two clauses Si and Sj
  • s i -embed represents the embedded vector representation of clause Si
  • s j -embed represents the embedded vector representation of clause Sj
  • cosine_similarity means to solve the cosine similarity
  • the calculation unit 103 calculates the position similarity between every two clauses in the plurality of clauses.
  • the computing unit 103 calculating the position similarity between every two clauses in the plurality of clauses includes:
  • the corresponding matrix cell value is the third value
  • Matrix transformation is performed according to the matrix cell value to obtain the positional similarity between every two clauses.
  • the first numerical value, the second numerical value, the third numerical value and the fourth numerical value can be customized.
  • the first numerical value can be configured as 2, so The second value is 1.5, the third value is 2.5, and the fourth value is 1.
  • the front preset position or the rear preset position can also be customized, for example, the front preset position can be configured as the first 5%, and correspondingly, the rear preset position can be configured as the rear 5%.
  • the specified attribute may be a summary attribute, that is, the sentence with the specified attribute is a summary sentence.
  • the fusion unit 104 performs fusion processing on the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the position similarity between each two clauses to obtain a graph adjacency matrix.
  • the mutual recommendation degree between each two clauses, the semantic similarity between each two clauses, and the positional similarity between each two clauses are fused using the following formula , to get the graph adjacency matrix:
  • mat adjc represents the graph adjacency matrix
  • mat t represents the mutual recommendation between each two clauses
  • mat s represents the semantic similarity between each two clauses
  • mat o represents the position between each two clauses Similarity
  • represents the weight of the mutual recommendation
  • represents the weight of the semantic similarity
  • ⁇ >0, ⁇ >0, and ⁇ + ⁇ 1.
  • the edge connecting the nodes of the two graphs is a single undirected edge, and this edge has only a single weight. From the perspective of this single undirected edge, the weights of the sentences at both ends of the node are equal. However, if any two sentences in the article are compared separately, their importance should also be divided into high and low. It is obviously wrong to treat the importance of the two sentences as equivalent.
  • the resulting graph adjacency models the graph node connection from a single undirected edge to two directed edges, which overcomes the defect of only a single undirected edge in the traditional scheme.
  • the calculation unit 103 inputs the graph adjacency matrix to the TextRank algorithm to calculate the importance of each clause.
  • the TextRank value of each node is iteratively calculated as the importance of each corresponding clause, which will not be repeated here.
  • the screening unit 105 screens according to the importance of each clause to obtain candidate clauses.
  • the screening unit 105 screens according to the importance of each clause, and obtains alternative clauses including:
  • a clause whose importance is greater than or equal to the preset threshold is acquired as the candidate clause.
  • the preset threshold can be customized, such as 95%.
  • the screening unit 105 performs screening according to the importance of each clause, and the obtained alternative clauses also include:
  • the clauses arranged before the preset position are determined as the candidate clauses.
  • the preset positions can be customized, such as 20 positions.
  • the preset position is equivalent to a hyperparameter, which can be obtained through experiments or debugging.
  • the rouge value of the abstract is used as an index to perform a hyperparameter search on the preset position, and the optimal rouge value is selected to correspond to The value of is used as the preset position.
  • the post-processing unit 106 performs post-processing on the candidate clauses to obtain a summary sentence.
  • the post-processing unit 106 performs post-processing on the candidate clauses to obtain a summary sentence including:
  • clauses may include, but not limited to: interrogative sentences and sentences composed of associated phrases.
  • the type of each clause in the candidate clauses may be judged according to keywords or symbols obtained through text recognition. For example: when "?" is recognized, it is judged as an interrogative sentence.
  • a summary sentence is a question sentence, usually the next adjacent sentence should also be judged as a summary; a summary sentence is "although...but", "because...so" in such sentences When one constituent sentence is used, the other half of the constituent sentence should usually be judged as an abstract as well.
  • post-processing is added on the basis of traditional text summarization, and the summarization result obtained by the graph algorithm is corrected to improve the quality of the final output summarization.
  • the summary sentences can be stored on the blockchain nodes.
  • the present application can respond to the text summary generation instruction, obtain the data to be processed according to the text summary generation instruction, and perform segmentation processing on the data to be processed to obtain multiple clauses.
  • the specific dictionary associated with the task scenario performs sentence segmentation in order to better segment out business-related entries, calculate the mutual recommendation between each two clauses in the multiple clauses, and only retain nouns , verbs, adjectives and adverbs are four parts of speech that are closely related to sentence semantics, and different weights are given to business words with different importance when calculating common word scores.
  • the above-mentioned apparatus for generating a text summary can be realized in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 3 .
  • FIG. 3 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .
  • the storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the computer program 5032 When executed, it can cause the processor 502 to execute the method for generating a text summary.
  • the processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for the running of the computer program 5032 in the storage medium 503.
  • the processor 502 can execute the method for generating a text summary.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation to the computer device 500 on which the solution of this application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run a computer program 5032 stored in the memory, so as to implement the method for generating a text abstract disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 3 does not constitute a limitation on the specific composition of the computer device.
  • the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 3 , and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a computer readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the method for generating a text abstract disclosed in the embodiment of the present application is implemented.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the disclosed devices, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a storage medium
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

La présente demande, qui relève du domaine de l'intelligence artificielle, concerne un procédé et appareil de génération de résumé de texte, un dispositif informatique et un support de stockage. Les défauts selon lesquels un algorithme TextRank classique, lors d'un calcul de la similarité de texte en clair, ne distingue pas l'importance de différents termes et également n'élimine pas par filtrage des mots sans importance selon les parties de paroles auxquelles lesdits mots appartiennent, peuvent être efficacement maîtrisés, ce qui améliore la possibilité de sélectionner des phrases ayant une grande pertinence commerciale pour un résumé ; la relation de proximité avant-après entre des phrases et les positions des phrases dans un article d'origine sont entièrement prises en compte dans un processus de modélisation, ce qui permet de maîtriser efficacement le problème d'une génération imprécise d'un résumé de texte causée par le fait que l'importance de la séquence de positions de phrases dans un article n'est pas prise en compte dans un mode classique ; et un post-traitement est ajouté sur la base d'une génération classique de résumé de texte, et un résultat de résumé acquis au moyen d'un algorithme à graphes est corrigé, ce qui améliore la qualité d'un résumé final, et réalise ainsi une génération plus précise de résumé de texte sur la base d'une intelligence artificielle. La présente demande concerne en outre la technologie des chaînes de blocs, et une phrase de résumé peut être stockée dans un nœud de chaînes de bloc.
PCT/CN2022/071791 2021-06-18 2022-01-13 Procédé et appareil de génération de résumé de texte, dispositif informatique et support de stockage WO2022262266A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110679639.7A CN113254593B (zh) 2021-06-18 2021-06-18 文本摘要生成方法、装置、计算机设备及存储介质
CN202110679639.7 2021-06-18

Publications (1)

Publication Number Publication Date
WO2022262266A1 true WO2022262266A1 (fr) 2022-12-22

Family

ID=77188647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/071791 WO2022262266A1 (fr) 2021-06-18 2022-01-13 Procédé et appareil de génération de résumé de texte, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN113254593B (fr)
WO (1) WO2022262266A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628186A (zh) * 2023-07-17 2023-08-22 乐麦信息技术(杭州)有限公司 文本摘要生成方法及系统
CN116188125B (zh) * 2023-03-10 2024-05-31 深圳市伙伴行网络科技有限公司 一种写字楼的招商管理方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254593B (zh) * 2021-06-18 2021-10-19 平安科技(深圳)有限公司 文本摘要生成方法、装置、计算机设备及存储介质
CN113590811A (zh) * 2021-08-19 2021-11-02 平安国际智慧城市科技股份有限公司 文本摘要生成方法、装置、电子设备及存储介质
CN113779978B (zh) * 2021-09-26 2024-05-24 上海一者信息科技有限公司 一种无监督跨语言句对齐实现方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016125949A1 (fr) * 2015-02-02 2016-08-11 숭실대학교 산학협력단 Procédé et serveur de résumé automatique de document
CN107133213A (zh) * 2017-05-06 2017-09-05 广东药科大学 一种基于算法的文本摘要自动提取方法与系统
CN110781291A (zh) * 2019-10-25 2020-02-11 北京市计算中心 一种文本摘要提取方法、装置、服务器及可读存储介质
CN111858912A (zh) * 2020-07-03 2020-10-30 黑龙江阳光惠远知识产权运营有限公司 一种基于单篇长文本的摘要生成方法
CN112347241A (zh) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 一种摘要提取方法、装置、设备及存储介质
CN113254593A (zh) * 2021-06-18 2021-08-13 平安科技(深圳)有限公司 文本摘要生成方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955772B (zh) * 2011-08-17 2015-11-25 北京百度网讯科技有限公司 一种基于语义的相似度计算方法和装置
CN112347240A (zh) * 2020-10-16 2021-02-09 小牛思拓(北京)科技有限公司 文本摘要的抽取方法、装置、可读存储介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016125949A1 (fr) * 2015-02-02 2016-08-11 숭실대학교 산학협력단 Procédé et serveur de résumé automatique de document
CN107133213A (zh) * 2017-05-06 2017-09-05 广东药科大学 一种基于算法的文本摘要自动提取方法与系统
CN110781291A (zh) * 2019-10-25 2020-02-11 北京市计算中心 一种文本摘要提取方法、装置、服务器及可读存储介质
CN111858912A (zh) * 2020-07-03 2020-10-30 黑龙江阳光惠远知识产权运营有限公司 一种基于单篇长文本的摘要生成方法
CN112347241A (zh) * 2020-11-10 2021-02-09 华夏幸福产业投资有限公司 一种摘要提取方法、装置、设备及存储介质
CN113254593A (zh) * 2021-06-18 2021-08-13 平安科技(深圳)有限公司 文本摘要生成方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI NANA, LIU PEIYU; LIU WENFENG; LIU WEITONG: "Automatic digest optimization algorithm based on TextRank", APPLICATION RESEARCH OF COMPUTERS, CHENGDU, CN, vol. 36, no. 4, 30 April 2019 (2019-04-30), CN , pages 1045 - 1050, XP093015759, ISSN: 1001-3695, DOI: 10.19734/j.issn.1001-3695.2017.11.0786 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188125B (zh) * 2023-03-10 2024-05-31 深圳市伙伴行网络科技有限公司 一种写字楼的招商管理方法、装置、电子设备及存储介质
CN116628186A (zh) * 2023-07-17 2023-08-22 乐麦信息技术(杭州)有限公司 文本摘要生成方法及系统
CN116628186B (zh) * 2023-07-17 2023-10-24 乐麦信息技术(杭州)有限公司 文本摘要生成方法及系统

Also Published As

Publication number Publication date
CN113254593A (zh) 2021-08-13
CN113254593B (zh) 2021-10-19

Similar Documents

Publication Publication Date Title
WO2022262266A1 (fr) Procédé et appareil de génération de résumé de texte, dispositif informatique et support de stockage
CN110993081B (zh) 一种医生在线推荐方法及系统
WO2019153551A1 (fr) Procédé et appareil de classification d'articles, dispositif informatique et support de stockage
CN107193959B (zh) 一种面向纯文本的企业实体分类方法
CN105824922B (zh) 一种融合深层特征和浅层特征的情感分类方法
WO2019200806A1 (fr) Dispositif de génération d'un modèle de classification de texte, procédé et support d'informations lisible par ordinateur
WO2017167067A1 (fr) Procédé et dispositif pour une classification de texte de page internet, procédé et dispositif pour une reconnaissance de texte de page internet
CN109670039B (zh) 基于三部图和聚类分析的半监督电商评论情感分析方法
RU2686000C1 (ru) Извлечение информационных объектов с использованием комбинации классификаторов, анализирующих локальные и нелокальные признаки
WO2022126810A1 (fr) Procédé de groupement de textes
RU2679988C1 (ru) Извлечение информационных объектов с помощью комбинации классификаторов
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN112347778A (zh) 关键词抽取方法、装置、终端设备及存储介质
CN110083832B (zh) 文章转载关系的识别方法、装置、设备及可读存储介质
US11593557B2 (en) Domain-specific grammar correction system, server and method for academic text
JP7281905B2 (ja) 文書評価装置、文書評価方法及びプログラム
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
CN112966508B (zh) 一种通用自动术语提取方法
CN109325122A (zh) 词表生成方法、文本分类方法、装置、设备及存储介质
CN113127607A (zh) 文本数据标注方法、装置、电子设备及可读存储介质
CN112307336A (zh) 热点资讯挖掘与预览方法、装置、计算机设备及存储介质
CN115600605A (zh) 一种中文实体关系联合抽取方法、系统、设备及存储介质
CN115146062A (zh) 融合专家推荐与文本聚类的智能事件分析方法和系统
US11650996B1 (en) Determining query intent and complexity using machine learning
CN115248890B (zh) 用户兴趣画像的生成方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22823762

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE