WO2022078308A1 - Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium - Google Patents

Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium Download PDF

Info

Publication number
WO2022078308A1
WO2022078308A1 PCT/CN2021/123175 CN2021123175W WO2022078308A1 WO 2022078308 A1 WO2022078308 A1 WO 2022078308A1 CN 2021123175 W CN2021123175 W CN 2021123175W WO 2022078308 A1 WO2022078308 A1 WO 2022078308A1
Authority
WO
WIPO (PCT)
Prior art keywords
paragraph
template
abstract
short sentence
category
Prior art date
Application number
PCT/CN2021/123175
Other languages
French (fr)
Chinese (zh)
Inventor
曹辰捷
徐国强
陈家豪
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022078308A1 publication Critical patent/WO2022078308A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, device, electronic device, and readable storage medium for generating a summary of a judgment document.
  • the method for generating the abstract of the judgment document provided in this application includes:
  • the present application also provides a device for generating a summary of a judgment document, the device comprising:
  • a parsing module configured to parse a user's request for generating a judgment document summary based on the client, and obtain the judgment document carried by the request;
  • the input module is used to input the judgment document into the trained paragraph category recognition model, and obtain the paragraph category of each paragraph in the judgment document, the paragraph category includes the first category and the second category, and the judgment document is A collection of paragraphs of the first category as a paragraph set;
  • a matching module configured to perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
  • the splicing module is used to input each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, and obtain the target abstract short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence corresponding to each paragraph
  • the position sequence of the template in the abstract template splices the target abstract short sentences to obtain the abstract text corresponding to the judgment document.
  • the present application also provides an electronic device, the electronic device comprising:
  • the memory stores a judgment document summary generation program executable by the at least one processor, and the judgment document summary generation program is executed by the at least one processor, so that the at least one processor can perform the following steps:
  • the present application also provides a computer-readable storage medium, where a program for generating a summary of a judgment document is stored thereon, and the program for generating a summary of a judgment document can be executed by one or more processors to implement the following steps:
  • FIG. 1 is a schematic flowchart of a method for generating a judgment document abstract according to an embodiment of the present application
  • FIG. 2 is a schematic block diagram of an apparatus for generating a summary of a judgment document provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an electronic device for implementing a method for generating a judgment document summary provided by an embodiment of the present application;
  • the embodiments of the present application may acquire and process related data based on artificial intelligence technology.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the present application provides a method for generating an abstract of a judgment document.
  • FIG. 1 a schematic flowchart of a method for generating a summary of a judgment document provided by an embodiment of the present application is shown.
  • the method may be performed by an electronic device, which may be implemented by software and/or hardware.
  • the method for generating a summary of a judgment document includes:
  • the paragraph categories include the first category and the second category, and the first category in the judgment document is A collection of paragraphs as a paragraph set.
  • the length of judgment documents is mainly distributed between 2000 and 8000 words, and the length of abstracts is mainly distributed between 200 and 600 words.
  • the current Chinese generation model cannot accommodate such a huge input and output. Paragraphs get a set of passages to compress the scale of information input to the summary generation model.
  • the paragraph category recognition model is a roberta-large-wwm model, which is used to determine whether each paragraph in the input judgment document belongs to the first category or the second category, where the first category is an important paragraph and the second category is an ordinary paragraph.
  • the roberta-large-wwm model is a derivative of the BERT-large model and contains 24 layers of transformers, 16 attention heads, and 1024 hidden layer units.
  • the training process of the paragraph category recognition model includes:
  • A1 Obtain multiple preset indexes corresponding to the paragraph categories of the judgment document, and mark the paragraph category for the first judgment document sample in the first database based on the multiple preset indexes;
  • the preset indicators include: the relationship between the plaintiff and the court, the plaintiff's claim, the court's opinion, the focus of the dispute, the statement and opinion of the legal facts, and the trial result.
  • the paragraphs associated with the above-mentioned six preset indicators in the first judgment document sample are marked as the first category (important paragraphs), and other paragraphs are marked as the second category (ordinary paragraphs).
  • A3. Determine the true paragraph category of each paragraph in the first referee text sample based on the annotation information, and determine the structural parameters of the paragraph category recognition model by minimizing the loss value between the predicted paragraph category and the true paragraph category, Get the trained paragraph category recognition model.
  • qi is the predicted paragraph category of the ith paragraph in the first judgment document sample
  • pi is the actual paragraph category of the ith paragraph in the first judgment document sample
  • c is the total number of paragraphs in the first judgment document sample
  • loss(q i , p i ) is the loss value between the predicted paragraph category and the real paragraph category of the i-th paragraph in the first referee document sample.
  • a preset threshold for example, 0.7
  • the important paragraphs in the judgment document are extracted by the paragraph category recognition model, which compresses the information scale, avoids the information input to the summary generation model being too long and overflows, and ensures the integrity of the input information of the summary generation model, so that the summary generation model The resulting summary is more accurate.
  • paragraphs in the paragraph set may still have redundant information (some paragraphs may have more than 500 words), and these paragraphs are not necessarily coherent before and after, and cannot be directly spliced as an abstract.
  • a summary template is preconfigured (the summary template includes the above-mentioned 6 preset indicators), and an example of the summary template is as follows: the plaintiff and the court have a relationship of XXXX. The plaintiff filed a petition and ordered the court to pay.... The court argued that the plaintiff's claim had no factual and legal basis, and upon finding out... This court supports the plaintiff's above request. According to Article X of the "Contract Law of the People's Republic of China" ... judgment, 1. The court shall pay the plaintiff XX fees. 2. To reject the plaintiff's other claims. If the obligation to pay money is not fulfilled within the period specified in this judgment, double the interest on the debt for the period of delay in performance.
  • the similarity matching is performed between each paragraph in the paragraph set and each short sentence template in the pre-configured summary template, and the target short sentence template corresponding to each paragraph in the paragraph set is obtained, including:
  • pi is the ith paragraph in the paragraph set
  • a j is the jth short sentence template in the abstract template
  • LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template
  • the length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template
  • len(pi ) is the length of the i - th paragraph in the paragraph set
  • LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template
  • LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template
  • the method further includes:
  • the short sentence template is used as the target short sentence template corresponding to the specified paragraph.
  • the method further includes:
  • the method further includes:
  • the summary generation model is also a roberta-large-wwm model, which is used to generate summary text according to paragraph information.
  • the paragraph category recognition model in this scheme is different from the input sample of the summary generation model, the training target is different, and the model parameters obtained by training are also different.
  • the training process of the abstract generation model includes:
  • C3. Determine the structural parameters of the summary generation model by minimizing the loss value between the real content corresponding to the mask and the predicted content, so as to obtain a trained summary generation model.
  • the abstract generation model predicts the probability distribution of the next token by using all the preceding tokens (words) in each second referee document sample.
  • words preceding tokens
  • this training task in order to meet the abstract generation, a piece of text content is reserved as a known text (25% to 75% of the content of each second judgment document sample), and another part of the text content (75% to 25% of the content of each second judgment document sample) is covered by masking characters.
  • the judgment document is input into the trained paragraph category recognition model, and the paragraph category of each paragraph in the judgment document is obtained, and the paragraph category includes the first category (that is, the first category).
  • important paragraphs and the second category (that is, ordinary paragraphs)
  • the set of paragraphs in the first category in the judgment document is used as the paragraph set.
  • the important paragraphs in the judgment document are extracted and put into the paragraph set through the paragraph category recognition model.
  • the information scale avoids the situation that the information in the subsequent input summary generation model is too long and overflows, causing the subsequent generated summary information to be incomplete and inaccurate;
  • the short sentence template performs similarity matching to obtain the target short sentence template corresponding to each paragraph in the paragraph set. This step further compresses the information scale by matching the similarity between the paragraphs in the paragraph set and the short sentence template in the abstract template;
  • Each paragraph in the paragraph set and its corresponding target short sentence template are input into the trained summary generation model, and the target short sentence corresponding to each paragraph in the paragraph set is obtained.
  • the target abstract sentences are spliced together to obtain the abstract text corresponding to the judgment document.
  • the target abstract sentences are spliced according to the positional order of the target short sentence templates corresponding to each paragraph in the abstract template, so as to ensure the coherence of the abstract. . Therefore, this application ensures the coherence and accuracy of the abstract of the judgment document.
  • FIG. 2 it is a schematic block diagram of an apparatus for generating a summary of a judgment document provided by an embodiment of the present application.
  • the apparatus 100 for generating a summary of a judgment document described in this application may be installed in an electronic device. According to the functions implemented, the apparatus 100 for generating a summary of a judgment document may include a parsing module 110 , an input module 120 , a matching module 130 and a splicing module 140 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the parsing module 110 is configured to parse a request for generating a judgment document summary sent by a user based on the client, and obtain the judgment document carried by the request;
  • the input module 120 is used to input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, the paragraph categories include the first category and the second category, and the judgment document is The collection of paragraphs in the first category is referred to as a paragraph set.
  • the length of judgment documents is mainly distributed between 2000 and 8000 words, and the length of abstracts is mainly distributed between 200 and 600 words.
  • the current Chinese generation model cannot accommodate such a huge input and output. Paragraphs get a set of passages to compress the scale of information input to the summary generation model.
  • the paragraph category recognition model is a roberta-large-wwm model, which is used to determine whether each paragraph in the input judgment document belongs to the first category or the second category, where the first category is an important paragraph and the second category is an ordinary paragraph.
  • the roberta-large-wwm model is a derivative of the BERT-large model and contains 24 layers of transformers, 16 attention heads, and 1024 hidden layer units.
  • the training process of the paragraph category recognition model includes:
  • A1 Obtain multiple preset indexes corresponding to the paragraph categories of the judgment document, and mark the paragraph category for the first judgment document sample in the first database based on the multiple preset indexes;
  • the preset indicators include: the relationship between the plaintiff and the court, the plaintiff's claim, the court's opinion, the focus of the dispute, the statement and opinion of the legal facts, and the trial result.
  • the paragraphs associated with the above-mentioned six preset indicators in the first judgment document sample are marked as the first category (important paragraphs), and other paragraphs are marked as the second category (ordinary paragraphs).
  • A3. Determine the true paragraph category of each paragraph in the first referee text sample based on the annotation information, and determine the structural parameters of the paragraph category recognition model by minimizing the loss value between the predicted paragraph category and the true paragraph category, Get the trained paragraph category recognition model.
  • qi is the predicted paragraph category of the ith paragraph in the first judgment document sample
  • pi is the actual paragraph category of the ith paragraph in the first judgment document sample
  • c is the total number of paragraphs in the first judgment document sample
  • loss(q i , p i ) is the loss value between the predicted paragraph category and the real paragraph category of the i-th paragraph in the first referee document sample.
  • a preset threshold for example, 0.7
  • the important paragraphs in the judgment document are extracted by the paragraph category recognition model, which compresses the information scale, avoids the information input to the summary generation model being too long and overflows, and ensures the integrity of the input information of the summary generation model, so that the summary generation model The resulting summary is more accurate.
  • the matching module 130 is configured to perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured abstract template, to obtain the target short sentence template corresponding to each paragraph in the paragraph set;
  • paragraphs in the paragraph set may still have redundant information (some paragraphs may have more than 500 words), and these paragraphs are not necessarily coherent before and after, and cannot be directly spliced as an abstract.
  • a summary template is preconfigured (the summary template includes the above-mentioned 6 preset indicators), and an example of the summary template is as follows: the plaintiff and the court have a relationship of XXXX. The plaintiff filed a petition and ordered the court to pay.... The court argued that the plaintiff's claim had no factual and legal basis, and upon finding out... This court supports the plaintiff's above request. According to Article X of the "Contract Law of the People's Republic of China" ... judgment, 1. The court shall pay the plaintiff XX fees. 2. To reject the plaintiff's other claims. If the obligation to pay money is not fulfilled within the period specified in this judgment, double the interest on the debt for the period of delay in performance.
  • the similarity matching is performed between each paragraph in the paragraph set and each short sentence template in the pre-configured summary template, and the target short sentence template corresponding to each paragraph in the paragraph set is obtained, including:
  • pi is the ith paragraph in the paragraph set
  • a j is the jth short sentence template in the abstract template
  • LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template
  • the length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template
  • len(pi ) is the length of the i - th paragraph in the paragraph set
  • LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template
  • LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template
  • the matching module 130 is also used for:
  • the short sentence template is used as the target short sentence template corresponding to the specified paragraph.
  • the matching module 130 is further configured to:
  • the matching module 130 is further configured to:
  • the splicing module 140 is used to input each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, and obtain the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence corresponding to each paragraph.
  • the target abstract short sentences are spliced together according to the position sequence of the sentence template in the abstract template, so as to obtain the abstract text corresponding to the judgment document.
  • the summary generation model is also a roberta-large-wwm model, which is used to generate summary text according to paragraph information.
  • the paragraph category recognition model in this scheme is different from the input sample of the summary generation model, the training target is different, and the model parameters obtained by training are also different.
  • the training process of the abstract generation model includes:
  • C3. Determine the structural parameters of the summary generation model by minimizing the loss value between the real content corresponding to the mask and the predicted content, so as to obtain a trained summary generation model.
  • the abstract generation model predicts the probability distribution of the next token by using all the preceding tokens (words) in each second referee document sample.
  • words preceding tokens
  • this training task in order to meet the abstract generation, a piece of text content is reserved as a known text (25% to 75% of the content of each second judgment document sample), and another part of the text content (75% to 25% of the content of each second judgment document sample) is covered by masking characters.
  • FIG. 3 it is a schematic structural diagram of an electronic device for implementing a method for generating a summary of a judgment document provided by an embodiment of the present application.
  • the electronic device 1 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud based on cloud computing composed of a large number of hosts or network servers, wherein cloud computing is a kind of distributed computing, A super virtual computer consisting of a collection of loosely coupled computers.
  • the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can be communicatively connected to each other through a system bus.
  • the abstract generation program 10 is executable by the processor 12 .
  • FIG. 3 only shows the electronic device 1 having the components 11-13 and the judgment document abstract generating program 10. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include Fewer or more components than shown, or some components are combined, or a different arrangement of components.
  • the memory 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium can be, for example, flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM) ), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. non-volatile storage media.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage unit of the electronic device 1
  • a storage device such as a pluggable hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card), etc. equipped on the electronic device 1.
  • the readable storage medium of the memory 11 is generally used to store the operating system and various application software installed in the electronic device 1 , for example, to store the code of the judgment document abstract generating program 10 in an embodiment of the present application.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 12 is generally used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices.
  • the processor 12 is configured to run the program code or process data stored in the memory 11 , for example, run the judgment document summary generation program 10 and the like.
  • the network interface 13 may include a wireless network interface or a wired network interface, and the network interface 13 is used to establish a communication connection between the electronic device 1 and a client (not shown in the figure).
  • the electronic device 1 may further include a user interface, and the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the judgment document summary generation program 10 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 12, can realize:
  • the above-mentioned judgment document abstract generating program 10 by the processor 12, reference may be made to the description of the relevant steps in the corresponding embodiment of FIG. 1, and details are not described herein. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned judgment documents, the above-mentioned judgment documents can also be stored in a node of a blockchain.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or non-volatile.
  • the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) ).
  • the computer-readable storage medium stores a judgment document summary generation program 10, and the judgment document summary generation program 10 can be executed by one or more processors to realize the following steps:
  • modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed is a method for generating a judgment document abstract. The method comprises: inputting a judgment document into a trained paragraph category identification model, so as to obtain paragraph categories of paragraphs in the judgment document, and taking a set of paragraphs of a first category as a paragraph set; respectively performing similarity matching on each paragraph in the paragraph set and a short sentence template in an abstract template, so as to obtain target short sentence templates corresponding to the paragraphs in the paragraph set; and inputting the paragraphs and the target short sentence templates corresponding thereto into a trained abstract generation model, so as to obtain target abstract short sentences corresponding to the paragraphs in the paragraph set, and combining the target abstract short sentences according to the position order, in the abstract template, of the target short sentence templates corresponding to the paragraphs, so as to obtain abstract text corresponding to the judgment document. Further provided are an apparatus for generating a judgment document abstract, and an electronic device and a readable storage medium. By means of the present application, the consistency and accuracy of a judgment document abstract are guaranteed.

Description

裁判文书摘要生成方法、装置、电子设备及可读存储介质Method, device, electronic device and readable storage medium for generating summary of judgment documents
本申请要求于2020年10月12日提交中国专利局、申请号为CN202011087426.7、名称为“裁判文书摘要生成方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number CN202011087426.7 and titled "Method, Apparatus, Electronic Device and Readable Storage Medium for Generating Judgment Document Abstracts" filed with the China Patent Office on October 12, 2020. The entire contents of this application are incorporated by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种裁判文书摘要生成方法、装置、电子设备及可读存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a method, device, electronic device, and readable storage medium for generating a summary of a judgment document.
背景技术Background technique
随着信息化时代的发展,摘要生成在人们生活中的应用越来越广泛,例如裁判文书的摘要生成,通过浏览摘要可快速了解裁判文本的内容梗概和关键信息,节省阅读时间。With the development of the information age, abstract generation has become more and more widely used in people's lives, such as the abstract generation of referee documents. By browsing the abstract, you can quickly understand the content outline and key information of the referee text, saving reading time.
发明人意识到,裁判文书书写规范,然而内容详尽冗长,当前通常通过从裁判文书中抽取权重较大的词语、短语和句子进行组合生成摘要,这种方式生成的摘要语义连贯性较差,缺乏对法律、裁判知识的有效融合,从而导致生成的摘要不连贯、不准确。因此,亟需一种裁判文书摘要生成方法,以确保裁判文书摘要的连贯性、准确性。The inventor realizes that the writing of judgment documents is standard, but the content is detailed and lengthy. At present, abstracts are usually generated by extracting words, phrases and sentences with greater weight from the judgment documents and combining them. The abstracts generated in this way have poor semantic coherence and lack. Effective integration of legal and adjudicative knowledge, resulting in incoherent and inaccurate summaries generated. Therefore, there is an urgent need for a method for generating abstracts of judgment documents to ensure the coherence and accuracy of the abstracts of judgment documents.
发明内容SUMMARY OF THE INVENTION
本申请提供的裁判文书摘要生成方法,包括:The method for generating the abstract of the judgment document provided in this application includes:
解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
本申请还提供一种裁判文书摘要生成装置,所述装置包括:The present application also provides a device for generating a summary of a judgment document, the device comprising:
解析模块,用于解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;A parsing module, configured to parse a user's request for generating a judgment document summary based on the client, and obtain the judgment document carried by the request;
输入模块,用于将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;The input module is used to input the judgment document into the trained paragraph category recognition model, and obtain the paragraph category of each paragraph in the judgment document, the paragraph category includes the first category and the second category, and the judgment document is A collection of paragraphs of the first category as a paragraph set;
匹配模块,用于将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;a matching module, configured to perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
拼接模块,用于将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。The splicing module is used to input each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, and obtain the target abstract short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence corresponding to each paragraph The position sequence of the template in the abstract template splices the target abstract short sentences to obtain the abstract text corresponding to the judgment document.
本申请还提供一种电子设备,所述电子设备包括:The present application also provides an electronic device, the electronic device comprising:
至少一个处理器;以及,at least one processor; and,
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的裁判文书摘要生成程序,所述裁判文书摘要生成程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步 骤:The memory stores a judgment document summary generation program executable by the at least one processor, and the judgment document summary generation program is executed by the at least one processor, so that the at least one processor can perform the following steps:
解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有裁判文书摘要生成程序,所述裁判文书摘要生成程序可被一个或者多个处理器执行,以实现如下步骤:The present application also provides a computer-readable storage medium, where a program for generating a summary of a judgment document is stored thereon, and the program for generating a summary of a judgment document can be executed by one or more processors to implement the following steps:
解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
附图说明Description of drawings
图1为本申请一实施例提供的裁判文书摘要生成方法的流程示意图;FIG. 1 is a schematic flowchart of a method for generating a judgment document abstract according to an embodiment of the present application;
图2为本申请一实施例提供的裁判文书摘要生成装置的模块示意图;FIG. 2 is a schematic block diagram of an apparatus for generating a summary of a judgment document provided by an embodiment of the present application;
图3为本申请一实施例提供的实现裁判文书摘要生成方法的电子设备的结构示意图;3 is a schematic structural diagram of an electronic device for implementing a method for generating a judgment document summary provided by an embodiment of the present application;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions involving "first", "second", etc. in this application are only for the purpose of description, and should not be construed as indicating or implying their relative importance or implying the number of indicated technical features . Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In addition, the technical solutions between the various embodiments can be combined with each other, but must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that the combination of such technical solutions does not exist. , is not within the scope of protection claimed in this application.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process related data based on artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视 觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本申请提供一种裁判文书摘要生成方法。参照图1所示,为本申请一实施例提供的裁判文书摘要生成方法的流程示意图。该方法可以由一个电子设备执行,该电子设备可以由软件和/或硬件实现。The present application provides a method for generating an abstract of a judgment document. Referring to FIG. 1 , a schematic flowchart of a method for generating a summary of a judgment document provided by an embodiment of the present application is shown. The method may be performed by an electronic device, which may be implemented by software and/or hardware.
本实施例中,裁判文书摘要生成方法包括:In this embodiment, the method for generating a summary of a judgment document includes:
S1、解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;S1. Parse the user's request for generating a judgment document summary based on the client, and obtain the judgment document carried by the request;
S2、将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集。S2. Input the judgment document into the trained paragraph category recognition model, and obtain the paragraph category of each paragraph in the judgment document. The paragraph categories include the first category and the second category, and the first category in the judgment document is A collection of paragraphs as a paragraph set.
现今裁判文书的长度主要分布在2000~8000字,摘要长度主要分布在200~600字,当前的中文生成模型无法容纳如此巨大的输入输出,本实施例通过段落类别识别模型抽取裁判文书中的重要段落得到段落集,以压缩输入至摘要生成模型的信息规模。At present, the length of judgment documents is mainly distributed between 2000 and 8000 words, and the length of abstracts is mainly distributed between 200 and 600 words. The current Chinese generation model cannot accommodate such a huge input and output. Paragraphs get a set of passages to compress the scale of information input to the summary generation model.
所述段落类别识别模型为roberta-large-wwm模型,用于判断输入的裁判文书中每个段落属于第一类别或第二类别,所述第一类别为重要段落,第二类别为普通段落。roberta-large-wwm模型是BERT-large模型的派生模型,包含24层transformers,16个attention head,1024个隐层单元。The paragraph category recognition model is a roberta-large-wwm model, which is used to determine whether each paragraph in the input judgment document belongs to the first category or the second category, where the first category is an important paragraph and the second category is an ordinary paragraph. The roberta-large-wwm model is a derivative of the BERT-large model and contains 24 layers of transformers, 16 attention heads, and 1024 hidden layer units.
所述段落类别识别模型的训练过程包括:The training process of the paragraph category recognition model includes:
A1、获取裁判文书类段落类别对应的多个预设指标,基于所述多个预设指标对第一数据库中的第一裁判文书样本进行段落类别标注;A1. Obtain multiple preset indexes corresponding to the paragraph categories of the judgment document, and mark the paragraph category for the first judgment document sample in the first database based on the multiple preset indexes;
本实施例中,所述预设指标包括:原被告关系、原告诉求、被告意见、争议焦点、法律事实陈述和意见、审判结果。将第一裁判文书样本中与上述6个预设指标相关联的段落标注为第一类别(重要段落),其它段落标注为第二类别(普通段落)。In this embodiment, the preset indicators include: the relationship between the plaintiff and the defendant, the plaintiff's claim, the defendant's opinion, the focus of the dispute, the statement and opinion of the legal facts, and the trial result. The paragraphs associated with the above-mentioned six preset indicators in the first judgment document sample are marked as the first category (important paragraphs), and other paragraphs are marked as the second category (ordinary paragraphs).
A2、将携带标注信息的第一裁判文本样本输入所述段落类别识别模型,得到所述第一裁判文本样本中每个段落的预测段落类别;A2. Input the first referee text sample carrying the annotation information into the paragraph category recognition model, and obtain the predicted paragraph category of each paragraph in the first referee text sample;
A3、基于所述标注信息确定所述第一裁判文本样本中每个段落的真实段落类别,通过最小化预测段落类别与真实段落类别之间的损失值确定所述段落类别识别模型的结构参数,得到训练好的段落类别识别模型。A3. Determine the true paragraph category of each paragraph in the first referee text sample based on the annotation information, and determine the structural parameters of the paragraph category recognition model by minimizing the loss value between the predicted paragraph category and the true paragraph category, Get the trained paragraph category recognition model.
所述损失值的计算公式为:The formula for calculating the loss value is:
Figure PCTCN2021123175-appb-000001
Figure PCTCN2021123175-appb-000001
其中,q i为第一裁判文书样本中第i个段落的预测段落类别,p i为第一裁判文书样本中第i个段落的真实段落类别,c为第一裁判文书样本中段落的总数量,loss(q i,p i)为第一裁判文书样本中第i个段落的预测段落类别与真实段落类别之间的损失值。 Among them, qi is the predicted paragraph category of the ith paragraph in the first judgment document sample, pi is the actual paragraph category of the ith paragraph in the first judgment document sample, and c is the total number of paragraphs in the first judgment document sample , loss(q i , p i ) is the loss value between the predicted paragraph category and the real paragraph category of the i-th paragraph in the first referee document sample.
将待生成摘要的裁判文书输入训练好的段落类别识别模型即可得到每个段落属于第一类别的概率值,当某一段落对应的概率值大于预设阈值(例如,0.7)时,认为该段落的段落类别为第一类别,将裁判文书中第一类别的段落的集合作为段落集,后续将根据段落集中的信息生成摘要信息。Input the judgment document to be generated into the trained paragraph category recognition model to obtain the probability value of each paragraph belonging to the first category. When the probability value corresponding to a paragraph is greater than a preset threshold (for example, 0.7), it is considered that the paragraph belongs to the first category. The paragraph category is the first category, and the set of paragraphs in the first category in the judgment document is used as the paragraph set, and the summary information will be generated according to the information in the paragraph set.
本步骤通过段落类别识别模型提取裁判文书中的重要段落,压缩了信息规模,避免了输入至摘要生成模型的信息过长而溢出,保证了摘要生成模型的输入信息的完整性,从而摘要生成模型生成的摘要更为准确。In this step, the important paragraphs in the judgment document are extracted by the paragraph category recognition model, which compresses the information scale, avoids the information input to the summary generation model being too long and overflows, and ensures the integrity of the input information of the summary generation model, so that the summary generation model The resulting summary is more accurate.
S3、将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;S3, carrying out similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain the target short sentence template corresponding to each paragraph in the paragraph set;
段落集中的段落可能仍然存在冗余信息(某些段落可能有500字以上),且这些段落 前后并不一定连贯,无法直接拼接作为摘要。The paragraphs in the paragraph set may still have redundant information (some paragraphs may have more than 500 words), and these paragraphs are not necessarily coherent before and after, and cannot be directly spliced as an abstract.
本实施例预先配置了摘要模板(摘要模板包括上述6个预设指标),摘要模板示例如下:原被告系XXXX关系。原告诉请,判令被告支付…,被告辩称,原告的诉求没有事实根据和法律依据,经查明…,应当承担违约责任。对原告上述请求,本院予以支持。依《中华人民共和国合同法》第X条…判决,一、被告偿付原告XX费。二、驳回原告其他诉讼请求。如果未按本判决指定的期间履行给付金钱义务,加倍支付迟延履行期间的债务利息。In this embodiment, a summary template is preconfigured (the summary template includes the above-mentioned 6 preset indicators), and an example of the summary template is as follows: the plaintiff and the defendant have a relationship of XXXX. The plaintiff filed a petition and ordered the defendant to pay.... The defendant argued that the plaintiff's claim had no factual and legal basis, and upon finding out... This court supports the plaintiff's above request. According to Article X of the "Contract Law of the People's Republic of China" ... judgment, 1. The defendant shall pay the plaintiff XX fees. 2. To reject the plaintiff's other claims. If the obligation to pay money is not fulfilled within the period specified in this judgment, double the interest on the debt for the period of delay in performance.
所述将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板,包括:The similarity matching is performed between each paragraph in the paragraph set and each short sentence template in the pre-configured summary template, and the target short sentence template corresponding to each paragraph in the paragraph set is obtained, including:
B1、计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值;B1, calculate the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the summary template;
B2、当某一指定段落与多个短句模板的最长公共子序列相似度值大于相似度阈值时,将最高相似度值对应的短句模板作为所述指定段落对应的目标短句模板。B2. When the longest common subsequence similarity value between a specified paragraph and multiple short sentence templates is greater than the similarity threshold, use the short sentence template corresponding to the highest similarity value as the target short sentence template corresponding to the specified paragraph.
所述最长公共子序列相似度值的计算公式为:The calculation formula of the longest common subsequence similarity value is:
Figure PCTCN2021123175-appb-000002
Figure PCTCN2021123175-appb-000002
Figure PCTCN2021123175-appb-000003
Figure PCTCN2021123175-appb-000003
其中,p i为段落集中第i个段落,a j为摘要模板中第j个短句模板,LCS(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度,len(a j)为摘要模板中第j个短句模板的长度,len(p i)为段落集中第i个段落的长度,LCSR(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值上限,LCSP(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值下限,LCSFscore(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列相似度值。 Among them, pi is the ith paragraph in the paragraph set, a j is the jth short sentence template in the abstract template, LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template The length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template, len(pi ) is the length of the i - th paragraph in the paragraph set, and LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template, LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template The lower limit of the longest common subsequence length ratio of the template, LCSFscore(pi ,a j ) is the longest common subsequence similarity value between the ith paragraph in the paragraph set and the jth short sentence template in the abstract template.
本实施例中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述方法还包括:In this embodiment, after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the method further includes:
若某一指定段落仅与一个短句模板的最长公共子序列相似度值大于相似度阈值,则将所述短句模板作为所述指定段落对应的目标短句模板。If the similarity value of the longest common subsequence of a specified paragraph and only one short sentence template is greater than the similarity threshold, the short sentence template is used as the target short sentence template corresponding to the specified paragraph.
本实施例中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述方法还包括:In this embodiment, after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the method further includes:
若某一指定段落与所述摘要模板中每个短句模板的最长公共子序列相似度值都小于相似度阈值,则从所述段落集中删除所述指定段落。If the longest common subsequence similarity value between a specified paragraph and each short sentence template in the abstract template is smaller than the similarity threshold, the specified paragraph is deleted from the paragraph set.
在本申请的另一个实施例中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述方法还包括:In another embodiment of the present application, after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the method further includes:
若所述段落集中有多个段落对应同一个短句模板,则将所述多个段落按照其在所述裁判文书中的段落顺序进行合并形成所述段落集中的一个新段落。If there are multiple paragraphs in the paragraph set corresponding to the same short sentence template, the multiple paragraphs are merged according to their paragraph order in the judgment document to form a new paragraph in the paragraph set.
本步骤将段落集中的各个段落与摘要模板中的每个短句模板进行相似度匹配,实现了对信息进一步的压缩。In this step, similarity matching is performed between each paragraph in the paragraph set and each short sentence template in the abstract template, so as to further compress the information.
S4、将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。S4. Input each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtain the target short sentence corresponding to each paragraph in the paragraph set, and create a summary in the abstract according to the target short sentence template corresponding to each paragraph. The target abstract short sentences are spliced in the order of positions in the template to obtain the abstract text corresponding to the judgment document.
本实施例中,所述摘要生成模型也为roberta-large-wwm模型,用于根据段落信息生成摘要文本。本方案中的段落类别识别模型与摘要生成模型输入的样本不同,训练目标不同,训练得到的模型参数也不同。In this embodiment, the summary generation model is also a roberta-large-wwm model, which is used to generate summary text according to paragraph information. The paragraph category recognition model in this scheme is different from the input sample of the summary generation model, the training target is different, and the model parameters obtained by training are also different.
所述摘要生成模型的训练过程包括:The training process of the abstract generation model includes:
C1、将第二数据库中的第二裁判文书样本中预设比例的文本内容用掩盖符掩盖,得到 第三裁判文书;C1. Cover the preset proportion of the text content in the second judgment document sample in the second database with a mask to obtain a third judgment document;
C2、将所述第三裁判文书输入所述摘要生成模型,得到被掩盖的文本的预测内容;C2. Input the third judgment document into the abstract generation model to obtain the predicted content of the masked text;
C3、通过最小化掩盖符对应的真实内容与预测内容之间的损失值确定所述摘要生成模型的结构参数,得到训练好的摘要生成模型。C3. Determine the structural parameters of the summary generation model by minimizing the loss value between the real content corresponding to the mask and the predicted content, so as to obtain a trained summary generation model.
本实施例中,摘要生成模型通过每个第二裁判文书样本中前面所有token(词语)来预测下一个token的概率分布,在本训练任务中为了契合摘要生成,保留一段文本内容为已知文本(每个第二裁判文书样本中25%~75%的内容),另一部分文本内容(每个第二裁判文书样本中75%~25%的内容)被掩盖符掩盖。In this embodiment, the abstract generation model predicts the probability distribution of the next token by using all the preceding tokens (words) in each second referee document sample. In this training task, in order to meet the abstract generation, a piece of text content is reserved as a known text (25% to 75% of the content of each second judgment document sample), and another part of the text content (75% to 25% of the content of each second judgment document sample) is covered by masking characters.
由上述实施例可知,本申请提出的裁判文书摘要生成方法,首先,将裁判文书输入训练好的段落类别识别模型,得到裁判文书中各个段落的段落类别,所述段落类别包括第一类别(即重要段落)及第二类别(即普通段落),将裁判文书中第一类别的段落的集合作为段落集,本步骤通过段落类别识别模型将裁判文书中重要段落提取出来放入段落集,压缩了信息规模,避免了后续输入摘要生成模型中的信息过长而溢出、造成后续生成的摘要信息不完整、不准确的情况;接着,将段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到段落集中各个段落对应的目标短句模板,本步骤通过将段落集中的段落与摘要模板中的短句模板进行相似度匹配,进一步压缩了信息规模;最后,将段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到裁判文书对应的摘要文本,本步骤依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,保证了摘要的连贯性。因此,本申请确保了裁判文书摘要的连贯性、准确性。It can be seen from the above embodiments that, in the method for generating a judgment document abstract proposed by the present application, first, the judgment document is input into the trained paragraph category recognition model, and the paragraph category of each paragraph in the judgment document is obtained, and the paragraph category includes the first category (that is, the first category). important paragraphs) and the second category (that is, ordinary paragraphs), the set of paragraphs in the first category in the judgment document is used as the paragraph set. In this step, the important paragraphs in the judgment document are extracted and put into the paragraph set through the paragraph category recognition model. The information scale avoids the situation that the information in the subsequent input summary generation model is too long and overflows, causing the subsequent generated summary information to be incomplete and inaccurate; The short sentence template performs similarity matching to obtain the target short sentence template corresponding to each paragraph in the paragraph set. This step further compresses the information scale by matching the similarity between the paragraphs in the paragraph set and the short sentence template in the abstract template; Each paragraph in the paragraph set and its corresponding target short sentence template are input into the trained summary generation model, and the target short sentence corresponding to each paragraph in the paragraph set is obtained. According to the position order of the target short sentence template corresponding to each paragraph in the abstract template The target abstract sentences are spliced together to obtain the abstract text corresponding to the judgment document. In this step, the target abstract sentences are spliced according to the positional order of the target short sentence templates corresponding to each paragraph in the abstract template, so as to ensure the coherence of the abstract. . Therefore, this application ensures the coherence and accuracy of the abstract of the judgment document.
如图2所示,为本申请一实施例提供的裁判文书摘要生成装置的模块示意图。As shown in FIG. 2 , it is a schematic block diagram of an apparatus for generating a summary of a judgment document provided by an embodiment of the present application.
本申请所述裁判文书摘要生成装置100可以安装于电子设备中。根据实现的功能,所述裁判文书摘要生成装置100可以包括解析模块110、输入模块120、匹配模块130及拼接模块140。本申请所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The apparatus 100 for generating a summary of a judgment document described in this application may be installed in an electronic device. According to the functions implemented, the apparatus 100 for generating a summary of a judgment document may include a parsing module 110 , an input module 120 , a matching module 130 and a splicing module 140 . The modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
解析模块110,用于解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;The parsing module 110 is configured to parse a request for generating a judgment document summary sent by a user based on the client, and obtain the judgment document carried by the request;
输入模块120,用于将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集。The input module 120 is used to input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, the paragraph categories include the first category and the second category, and the judgment document is The collection of paragraphs in the first category is referred to as a paragraph set.
现今裁判文书的长度主要分布在2000~8000字,摘要长度主要分布在200~600字,当前的中文生成模型无法容纳如此巨大的输入输出,本实施例通过段落类别识别模型抽取裁判文书中的重要段落得到段落集,以压缩输入至摘要生成模型的信息规模。At present, the length of judgment documents is mainly distributed between 2000 and 8000 words, and the length of abstracts is mainly distributed between 200 and 600 words. The current Chinese generation model cannot accommodate such a huge input and output. Paragraphs get a set of passages to compress the scale of information input to the summary generation model.
所述段落类别识别模型为roberta-large-wwm模型,用于判断输入的裁判文书中每个段落属于第一类别或第二类别,所述第一类别为重要段落,第二类别为普通段落。roberta-large-wwm模型是BERT-large模型的派生模型,包含24层transformers,16个attention head,1024个隐层单元。The paragraph category recognition model is a roberta-large-wwm model, which is used to determine whether each paragraph in the input judgment document belongs to the first category or the second category, where the first category is an important paragraph and the second category is an ordinary paragraph. The roberta-large-wwm model is a derivative of the BERT-large model and contains 24 layers of transformers, 16 attention heads, and 1024 hidden layer units.
所述段落类别识别模型的训练过程包括:The training process of the paragraph category recognition model includes:
A1、获取裁判文书类段落类别对应的多个预设指标,基于所述多个预设指标对第一数据库中的第一裁判文书样本进行段落类别标注;A1. Obtain multiple preset indexes corresponding to the paragraph categories of the judgment document, and mark the paragraph category for the first judgment document sample in the first database based on the multiple preset indexes;
本实施例中,所述预设指标包括:原被告关系、原告诉求、被告意见、争议焦点、法律事实陈述和意见、审判结果。将第一裁判文书样本中与上述6个预设指标相关联的段落标注为第一类别(重要段落),其它段落标注为第二类别(普通段落)。In this embodiment, the preset indicators include: the relationship between the plaintiff and the defendant, the plaintiff's claim, the defendant's opinion, the focus of the dispute, the statement and opinion of the legal facts, and the trial result. The paragraphs associated with the above-mentioned six preset indicators in the first judgment document sample are marked as the first category (important paragraphs), and other paragraphs are marked as the second category (ordinary paragraphs).
A2、将携带标注信息的第一裁判文本样本输入所述段落类别识别模型,得到所述第一裁判文本样本中每个段落的预测段落类别;A2. Input the first referee text sample carrying the annotation information into the paragraph category recognition model, and obtain the predicted paragraph category of each paragraph in the first referee text sample;
A3、基于所述标注信息确定所述第一裁判文本样本中每个段落的真实段落类别,通过最小化预测段落类别与真实段落类别之间的损失值确定所述段落类别识别模型的结构参数,得到训练好的段落类别识别模型。A3. Determine the true paragraph category of each paragraph in the first referee text sample based on the annotation information, and determine the structural parameters of the paragraph category recognition model by minimizing the loss value between the predicted paragraph category and the true paragraph category, Get the trained paragraph category recognition model.
所述损失值的计算公式为:The formula for calculating the loss value is:
Figure PCTCN2021123175-appb-000004
Figure PCTCN2021123175-appb-000004
其中,q i为第一裁判文书样本中第i个段落的预测段落类别,p i为第一裁判文书样本中第i个段落的真实段落类别,c为第一裁判文书样本中段落的总数量,loss(q i,p i)为第一裁判文书样本中第i个段落的预测段落类别与真实段落类别之间的损失值。 Among them, qi is the predicted paragraph category of the ith paragraph in the first judgment document sample, pi is the actual paragraph category of the ith paragraph in the first judgment document sample, and c is the total number of paragraphs in the first judgment document sample , loss(q i , p i ) is the loss value between the predicted paragraph category and the real paragraph category of the i-th paragraph in the first referee document sample.
将待生成摘要的裁判文书输入训练好的段落类别识别模型即可得到每个段落属于第一类别的概率值,当某一段落对应的概率值大于预设阈值(例如,0.7)时,认为该段落的段落类别为第一类别,将裁判文书中第一类别的段落的集合作为段落集,后续将根据段落集中的信息生成摘要信息。Input the judgment document to be generated into the trained paragraph category recognition model to obtain the probability value of each paragraph belonging to the first category. When the probability value corresponding to a paragraph is greater than a preset threshold (for example, 0.7), it is considered that the paragraph belongs to the first category. The paragraph category is the first category, and the set of paragraphs in the first category in the judgment document is used as the paragraph set, and the summary information will be generated according to the information in the paragraph set.
本步骤通过段落类别识别模型提取裁判文书中的重要段落,压缩了信息规模,避免了输入至摘要生成模型的信息过长而溢出,保证了摘要生成模型的输入信息的完整性,从而摘要生成模型生成的摘要更为准确。In this step, the important paragraphs in the judgment document are extracted by the paragraph category recognition model, which compresses the information scale, avoids the information input to the summary generation model being too long and overflows, and ensures the integrity of the input information of the summary generation model, so that the summary generation model The resulting summary is more accurate.
匹配模块130,用于将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;The matching module 130 is configured to perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured abstract template, to obtain the target short sentence template corresponding to each paragraph in the paragraph set;
段落集中的段落可能仍然存在冗余信息(某些段落可能有500字以上),且这些段落前后并不一定连贯,无法直接拼接作为摘要。The paragraphs in the paragraph set may still have redundant information (some paragraphs may have more than 500 words), and these paragraphs are not necessarily coherent before and after, and cannot be directly spliced as an abstract.
本实施例预先配置了摘要模板(摘要模板包括上述6个预设指标),摘要模板示例如下:原被告系XXXX关系。原告诉请,判令被告支付…,被告辩称,原告的诉求没有事实根据和法律依据,经查明…,应当承担违约责任。对原告上述请求,本院予以支持。依《中华人民共和国合同法》第X条…判决,一、被告偿付原告XX费。二、驳回原告其他诉讼请求。如果未按本判决指定的期间履行给付金钱义务,加倍支付迟延履行期间的债务利息。In this embodiment, a summary template is preconfigured (the summary template includes the above-mentioned 6 preset indicators), and an example of the summary template is as follows: the plaintiff and the defendant have a relationship of XXXX. The plaintiff filed a petition and ordered the defendant to pay.... The defendant argued that the plaintiff's claim had no factual and legal basis, and upon finding out... This court supports the plaintiff's above request. According to Article X of the "Contract Law of the People's Republic of China" ... judgment, 1. The defendant shall pay the plaintiff XX fees. 2. To reject the plaintiff's other claims. If the obligation to pay money is not fulfilled within the period specified in this judgment, double the interest on the debt for the period of delay in performance.
所述将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板,包括:The similarity matching is performed between each paragraph in the paragraph set and each short sentence template in the pre-configured summary template, and the target short sentence template corresponding to each paragraph in the paragraph set is obtained, including:
B1、计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值;B1, calculate the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the summary template;
B2、当某一指定段落与多个短句模板的最长公共子序列相似度值大于相似度阈值时,将最高相似度值对应的短句模板作为所述指定段落对应的目标短句模板。B2. When the longest common subsequence similarity value between a specified paragraph and multiple short sentence templates is greater than the similarity threshold, use the short sentence template corresponding to the highest similarity value as the target short sentence template corresponding to the specified paragraph.
所述最长公共子序列相似度值的计算公式为:The calculation formula of the longest common subsequence similarity value is:
Figure PCTCN2021123175-appb-000005
Figure PCTCN2021123175-appb-000005
Figure PCTCN2021123175-appb-000006
Figure PCTCN2021123175-appb-000006
其中,p i为段落集中第i个段落,a j为摘要模板中第j个短句模板,LCS(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度,len(a j)为摘要模板中第j个短句模板的长度,len(p i)为段落集中第i个段落的长度,LCSR(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值上限,LCSP(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值下限,LCSFscore(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列相似度值。 Among them, pi is the ith paragraph in the paragraph set, a j is the jth short sentence template in the abstract template, LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template The length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template, len(pi ) is the length of the i - th paragraph in the paragraph set, and LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template, LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template The lower limit of the longest common subsequence length ratio of the template, LCSFscore(pi ,a j ) is the longest common subsequence similarity value between the ith paragraph in the paragraph set and the jth short sentence template in the abstract template.
本实施例中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子 序列相似度值之后,所述匹配模块130还用于:In the present embodiment, after calculating the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template, the matching module 130 is also used for:
若某一指定段落仅与一个短句模板的最长公共子序列相似度值大于相似度阈值,则将所述短句模板作为所述指定段落对应的目标短句模板。If the similarity value of the longest common subsequence of a specified paragraph and only one short sentence template is greater than the similarity threshold, the short sentence template is used as the target short sentence template corresponding to the specified paragraph.
本实施例中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述匹配模块130还用于:In this embodiment, after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the matching module 130 is further configured to:
若某一指定段落与所述摘要模板中每个短句模板的最长公共子序列相似度值都小于相似度阈值,则从所述段落集中删除所述指定段落。If the longest common subsequence similarity value between a specified paragraph and each short sentence template in the abstract template is smaller than the similarity threshold, the specified paragraph is deleted from the paragraph set.
在本申请的另一个实施例中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述匹配模块130还用于:In another embodiment of the present application, after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the matching module 130 is further configured to:
若所述段落集中有多个段落对应同一个短句模板,则将所述多个段落按照其在所述裁判文书中的段落顺序进行合并形成所述段落集中的一个新段落。If there are multiple paragraphs in the paragraph set corresponding to the same short sentence template, the multiple paragraphs are merged according to their paragraph order in the judgment document to form a new paragraph in the paragraph set.
本步骤将段落集中的各个段落与摘要模板中的每个短句模板进行相似度匹配,实现了对信息进一步的压缩。In this step, similarity matching is performed between each paragraph in the paragraph set and each short sentence template in the abstract template, so as to further compress the information.
拼接模块140,用于将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。The splicing module 140 is used to input each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, and obtain the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence corresponding to each paragraph. The target abstract short sentences are spliced together according to the position sequence of the sentence template in the abstract template, so as to obtain the abstract text corresponding to the judgment document.
本实施例中,所述摘要生成模型也为roberta-large-wwm模型,用于根据段落信息生成摘要文本。本方案中的段落类别识别模型与摘要生成模型输入的样本不同,训练目标不同,训练得到的模型参数也不同。In this embodiment, the summary generation model is also a roberta-large-wwm model, which is used to generate summary text according to paragraph information. The paragraph category recognition model in this scheme is different from the input sample of the summary generation model, the training target is different, and the model parameters obtained by training are also different.
所述摘要生成模型的训练过程包括:The training process of the abstract generation model includes:
C1、将第二数据库中的第二裁判文书样本中预设比例的文本内容用掩盖符掩盖,得到第三裁判文书;C1. Cover the preset proportion of the text content in the second judgment document sample in the second database with a mask to obtain a third judgment document;
C2、将所述第三裁判文书输入所述摘要生成模型,得到被掩盖的文本的预测内容;C2. Input the third judgment document into the abstract generation model to obtain the predicted content of the masked text;
C3、通过最小化掩盖符对应的真实内容与预测内容之间的损失值确定所述摘要生成模型的结构参数,得到训练好的摘要生成模型。C3. Determine the structural parameters of the summary generation model by minimizing the loss value between the real content corresponding to the mask and the predicted content, so as to obtain a trained summary generation model.
本实施例中,摘要生成模型通过每个第二裁判文书样本中前面所有token(词语)来预测下一个token的概率分布,在本训练任务中为了契合摘要生成,保留一段文本内容为已知文本(每个第二裁判文书样本中25%~75%的内容),另一部分文本内容(每个第二裁判文书样本中75%~25%的内容)被掩盖符掩盖。In this embodiment, the abstract generation model predicts the probability distribution of the next token by using all the preceding tokens (words) in each second referee document sample. In this training task, in order to meet the abstract generation, a piece of text content is reserved as a known text (25% to 75% of the content of each second judgment document sample), and another part of the text content (75% to 25% of the content of each second judgment document sample) is covered by masking characters.
如图3所示,为本申请一实施例提供的实现裁判文书摘要生成方法的电子设备的结构示意图。As shown in FIG. 3 , it is a schematic structural diagram of an electronic device for implementing a method for generating a summary of a judgment document provided by an embodiment of the present application.
所述电子设备1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述电子设备1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。The electronic device 1 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. The electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud based on cloud computing composed of a large number of hosts or network servers, wherein cloud computing is a kind of distributed computing, A super virtual computer consisting of a collection of loosely coupled computers.
在本实施例中,电子设备1包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12、网络接口13,该存储器11中存储有裁判文书摘要生成程序10,所述裁判文书摘要生成程序10可被所述处理器12执行。图3仅示出了具有组件11-13以及裁判文书摘要生成程序10的电子设备1,本领域技术人员可以理解的是,图3示出的结构并不构成对电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。In this embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can be communicatively connected to each other through a system bus. The abstract generation program 10 is executable by the processor 12 . FIG. 3 only shows the electronic device 1 having the components 11-13 and the judgment document abstract generating program 10. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include Fewer or more components than shown, or some components are combined, or a different arrangement of components.
其中,存储器11包括内存及至少一种类型的可读存储介质。内存为电子设备1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、 电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子设备1的内部存储单元,例如该电子设备1的硬盘;在另一些实施例中,该非易失性存储介质也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储器11的可读存储介质通常用于存储安装于电子设备1的操作系统和各类应用软件,例如存储本申请一实施例中的裁判文书摘要生成程序10的代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium can be, for example, flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM) ), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. non-volatile storage media. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage unit of the electronic device 1 A storage device, such as a pluggable hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card), etc. equipped on the electronic device 1. In this embodiment, the readable storage medium of the memory 11 is generally used to store the operating system and various application software installed in the electronic device 1 , for example, to store the code of the judgment document abstract generating program 10 in an embodiment of the present application. In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述电子设备1的总体操作,例如执行与其他设备进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行裁判文书摘要生成程序10等。In some embodiments, the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. The processor 12 is generally used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices. In this embodiment, the processor 12 is configured to run the program code or process data stored in the memory 11 , for example, run the judgment document summary generation program 10 and the like.
网络接口13可包括无线网络接口或有线网络接口,该网络接口13用于在所述电子设备1与客户端(图中未画出)之间建立通信连接。The network interface 13 may include a wireless network interface or a wired network interface, and the network interface 13 is used to establish a communication connection between the electronic device 1 and a client (not shown in the figure).
可选的,所述电子设备1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选的,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, and the user interface may include a display (Display), an input unit such as a keyboard (Keyboard), and an optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.
所述电子设备1中的所述存储器11存储的裁判文书摘要生成程序10是多个指令的组合,在所述处理器12中运行时,可以实现:The judgment document summary generation program 10 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 12, can realize:
解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
具体地,所述处理器12对上述裁判文书摘要生成程序10的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。需要强调的是,为进一步保证上述裁判文书的私密和安全性,上述裁判文书还可以存储于一区块链的节点中。Specifically, for the specific implementation method of the above-mentioned judgment document abstract generating program 10 by the processor 12, reference may be made to the description of the relevant steps in the corresponding embodiment of FIG. 1, and details are not described herein. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned judgment documents, the above-mentioned judgment documents can also be stored in a node of a blockchain.
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。所述计算机可读存储介质可以是非易失性的,也可以是非易失性的。所述计算机可读存储介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be non-volatile or non-volatile. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) ).
所述计算机可读存储介质上存储有裁判文书摘要生成程序10,所述裁判文书摘要生成程序10可被一个或者多个处理器执行,以实现如下步骤:The computer-readable storage medium stores a judgment document summary generation program 10, and the judgment document summary generation program 10 can be executed by one or more processors to realize the following steps:
解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段 落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种裁判文书摘要生成方法,其中,所述方法包括:A method for generating a summary of a judgment document, wherein the method comprises:
    解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
    将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
    将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
    将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
  2. 如权利要求1所述的裁判文书摘要生成方法,其中,所述将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板,包括:The method for generating an abstract of a judgment document according to claim 1, wherein the similarity matching is performed between each paragraph in the paragraph set and each short sentence template in a preconfigured abstract template to obtain each paragraph in the paragraph set. The corresponding target phrase templates, including:
    计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值;Calculate the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template;
    当某一指定段落与多个短句模板的最长公共子序列相似度值大于相似度阈值时,将最高相似度值对应的短句模板作为所述指定段落对应的目标短句模板。When the similarity value of the longest common subsequence between a specified paragraph and multiple short sentence templates is greater than the similarity threshold, the short sentence template corresponding to the highest similarity value is used as the target short sentence template corresponding to the specified paragraph.
  3. 如权利要求2所述的裁判文书摘要生成方法,其中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述方法还包括:The method for generating an abstract of a judgment document according to claim 2, wherein after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the method further comprises:
    若某一指定段落与所述摘要模板中每个短句模板的最长公共子序列相似度值都小于相似度阈值,则从所述段落集中删除所述指定段落。If the longest common subsequence similarity value between a specified paragraph and each short sentence template in the abstract template is smaller than the similarity threshold, the specified paragraph is deleted from the paragraph set.
  4. 如权利要求2所述的裁判文书摘要生成方法,其中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述方法还包括:The method for generating an abstract of a judgment document according to claim 2, wherein after calculating the longest common subsequence similarity value between each paragraph in the paragraph set and each short sentence template in the abstract template, the method further comprises:
    若所述段落集中有多个段落对应同一个短句模板,则将所述多个段落按照其在所述裁判文书中的段落顺序进行合并形成所述段落集中的一个新段落。If there are multiple paragraphs in the paragraph set corresponding to the same short sentence template, the multiple paragraphs are merged according to their paragraph order in the judgment document to form a new paragraph in the paragraph set.
  5. 如权利要求2所述的裁判文书摘要生成方法,其中,所述最长公共子序列相似度值的计算公式为:The method for generating a judgment document summary according to claim 2, wherein the calculation formula of the longest common subsequence similarity value is:
    Figure PCTCN2021123175-appb-100001
    Figure PCTCN2021123175-appb-100001
    Figure PCTCN2021123175-appb-100002
    Figure PCTCN2021123175-appb-100002
    其中,p i为段落集中第i个段落,a j为摘要模板中第j个短句模板,LCS(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度,len(a j)为摘要模板中第j个短句模板的长度,len(p i)为段落集中第i个段落的长度,LCSR(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值上限,LCSP(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值下限,LCSFscore(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列相似度值。 Among them, pi is the ith paragraph in the paragraph set, a j is the jth short sentence template in the abstract template, LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template The length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template, len(pi ) is the length of the i - th paragraph in the paragraph set, and LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template, LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template The lower limit of the length ratio of the longest common subsequence of the template, LCSFscore(pi , a j ) is the longest common subsequence similarity value between the ith paragraph in the paragraph set and the jth short sentence template in the abstract template.
  6. 如权利要求1所述的裁判文书摘要生成方法,其中,所述段落类别识别模型的训练过程包括:The method for generating a summary of a judgment document according to claim 1, wherein the training process of the paragraph category recognition model comprises:
    获取裁判文书类段落类别对应的多个预设指标,基于所述多个预设指标对第一数据库中的第一裁判文书样本进行段落类别标注;Obtaining multiple preset indexes corresponding to the paragraph categories of the judgment document, and marking the paragraph category for the first judgment document sample in the first database based on the multiple preset indexes;
    将携带标注信息的第一裁判文本样本输入所述段落类别识别模型,得到所述第一裁判文本样本中每个段落的预测段落类别;Inputting the first referee text sample carrying the annotation information into the paragraph category recognition model to obtain the predicted paragraph category of each paragraph in the first referee text sample;
    基于所述标注信息确定所述第一裁判文本样本中每个段落的真实段落类别,通过最小 化预测段落类别与真实段落类别之间的损失值确定所述段落类别识别模型的结构参数,得到训练好的段落类别识别模型。Determine the true paragraph category of each paragraph in the first referee text sample based on the annotation information, and determine the structural parameters of the paragraph category recognition model by minimizing the loss value between the predicted paragraph category and the actual paragraph category, and obtain training. Good paragraph category recognition model.
  7. 如权利要求1所述的裁判文书摘要生成方法,其中,所述摘要生成模型的训练过程包括:The method for generating a summary of a judgment document according to claim 1, wherein the training process of the summary generation model comprises:
    将第二数据库中的第二裁判文书样本中预设比例的文本内容用掩盖符掩盖,得到第三裁判文书;Covering a preset proportion of the text content in the second judgment document sample in the second database with a mask to obtain a third judgment document;
    将所述第三裁判文书输入所述摘要生成模型,得到被掩盖的文本的预测内容;Inputting the third judgment document into the abstract generation model to obtain the predicted content of the masked text;
    通过最小化掩盖符对应的真实内容与预测内容之间的损失值确定所述摘要生成模型的结构参数,得到训练好的摘要生成模型。The structural parameters of the summary generation model are determined by minimizing the loss value between the real content corresponding to the mask and the predicted content, and a trained summary generation model is obtained.
  8. 一种裁判文书摘要生成装置,其中,所述装置包括:A device for generating a summary of a judgment document, wherein the device includes:
    解析模块,用于解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;A parsing module, used for parsing the user's request for generating a judgment document summary based on the client, and obtaining the judgment document carried by the request;
    输入模块,用于将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;The input module is used to input the judgment document into the trained paragraph category recognition model, and obtain the paragraph category of each paragraph in the judgment document, the paragraph category includes the first category and the second category, and the judgment document is A collection of paragraphs of the first category as a paragraph set;
    匹配模块,用于将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;a matching module, configured to perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
    拼接模块,用于将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。The splicing module is used to input each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, and obtain the target abstract short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence corresponding to each paragraph The position sequence of the template in the abstract template splices the target abstract short sentences to obtain the abstract text corresponding to the judgment document.
  9. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device comprises:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的裁判文书摘要生成程序,所述裁判文书摘要生成程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步骤:The memory stores a judgment document summary generation program executable by the at least one processor, and the judgment document summary generation program is executed by the at least one processor, so that the at least one processor can perform the following steps:
    解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
    将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
    将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
    将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
  10. 如权利要求9所述的电子设备,其中,所述将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板,包括:The electronic device according to claim 9, wherein, by performing similarity matching between each paragraph in the paragraph set and each short sentence template in a preconfigured abstract template, the target corresponding to each paragraph in the paragraph set is obtained. Short sentence templates, including:
    计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值;Calculate the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template;
    当某一指定段落与多个短句模板的最长公共子序列相似度值大于相似度阈值时,将最高相似度值对应的短句模板作为所述指定段落对应的目标短句模板。When the similarity value of the longest common subsequence between a specified paragraph and multiple short sentence templates is greater than the similarity threshold, the short sentence template corresponding to the highest similarity value is used as the target short sentence template corresponding to the specified paragraph.
  11. 如权利要求10所述的电子设备,其中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述裁判文书摘要生成程序被处理器执行时还实现如下步骤:The electronic device according to claim 10, wherein after calculating the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template, the judgment document abstract generating program is processed by the processor The following steps are also implemented during execution:
    若某一指定段落与所述摘要模板中每个短句模板的最长公共子序列相似度值都小于相似度阈值,则从所述段落集中删除所述指定段落。If the longest common subsequence similarity value between a specified paragraph and each short sentence template in the abstract template is smaller than the similarity threshold, the specified paragraph is deleted from the paragraph set.
  12. 如权利要求10所述的电子设备,其中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述裁判文书摘要生成程序被处理器执行时还实现如下步骤:The electronic device according to claim 10, wherein after calculating the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template, the judgment document abstract generating program is processed by the processor The following steps are also implemented during execution:
    若所述段落集中有多个段落对应同一个短句模板,则将所述多个段落按照其在所述裁判文书中的段落顺序进行合并形成所述段落集中的一个新段落。If there are multiple paragraphs in the paragraph set corresponding to the same short sentence template, the multiple paragraphs are merged according to their paragraph order in the judgment document to form a new paragraph in the paragraph set.
  13. 如权利要求10所述的电子设备,其中,所述最长公共子序列相似度值的计算公式为:The electronic device according to claim 10, wherein the calculation formula of the longest common subsequence similarity value is:
    Figure PCTCN2021123175-appb-100003
    Figure PCTCN2021123175-appb-100003
    Figure PCTCN2021123175-appb-100004
    Figure PCTCN2021123175-appb-100004
    其中,p i为段落集中第i个段落,a j为摘要模板中第j个短句模板,LCS(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度,len(a j)为摘要模板中第j个短句模板的长度,len(p i)为段落集中第i个段落的长度,LCSR(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值上限,LCSP(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值下限,LCSFscore(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列相似度值。 Among them, pi is the ith paragraph in the paragraph set, a j is the jth short sentence template in the abstract template, LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template The length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template, len(pi ) is the length of the i - th paragraph in the paragraph set, and LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template, LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template The lower limit of the longest common subsequence length ratio of the template, LCSFscore(pi ,a j ) is the longest common subsequence similarity value between the ith paragraph in the paragraph set and the jth short sentence template in the abstract template.
  14. 如权利要求9所述的电子设备,其中,所述段落类别识别模型的训练过程包括:The electronic device according to claim 9, wherein the training process of the paragraph category recognition model comprises:
    获取裁判文书类段落类别对应的多个预设指标,基于所述多个预设指标对第一数据库中的第一裁判文书样本进行段落类别标注;Obtaining multiple preset indexes corresponding to the paragraph categories of the judgment document, and marking the paragraph category for the first judgment document sample in the first database based on the multiple preset indexes;
    将携带标注信息的第一裁判文本样本输入所述段落类别识别模型,得到所述第一裁判文本样本中每个段落的预测段落类别;Inputting the first referee text sample carrying the annotation information into the paragraph category recognition model to obtain the predicted paragraph category of each paragraph in the first referee text sample;
    基于所述标注信息确定所述第一裁判文本样本中每个段落的真实段落类别,通过最小化预测段落类别与真实段落类别之间的损失值确定所述段落类别识别模型的结构参数,得到训练好的段落类别识别模型。Determine the true paragraph category of each paragraph in the first referee text sample based on the annotation information, and determine the structural parameters of the paragraph category recognition model by minimizing the loss value between the predicted paragraph category and the actual paragraph category, and obtain training. Good paragraph category recognition model.
  15. 如权利要求9所述的电子设备,其中,所述摘要生成模型的训练过程包括:The electronic device according to claim 9, wherein the training process of the abstract generation model comprises:
    将第二数据库中的第二裁判文书样本中预设比例的文本内容用掩盖符掩盖,得到第三裁判文书;Covering a preset proportion of the text content in the second judgment document sample in the second database with a mask to obtain a third judgment document;
    将所述第三裁判文书输入所述摘要生成模型,得到被掩盖的文本的预测内容;Inputting the third judgment document into the abstract generation model to obtain the predicted content of the masked text;
    通过最小化掩盖符对应的真实内容与预测内容之间的损失值确定所述摘要生成模型的结构参数,得到训练好的摘要生成模型。The structural parameters of the summary generation model are determined by minimizing the loss value between the real content corresponding to the mask and the predicted content, and a trained summary generation model is obtained.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有裁判文书摘要生成程序,所述裁判文书摘要生成程序可被一个或者多个处理器执行,以实现如下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a judgment document summary generation program, and the judgment document summary generation program can be executed by one or more processors to realize the following steps:
    解析用户基于客户端发出的裁判文书摘要生成请求,获取所述请求携带的裁判文书;Parse the user's request to generate a judgment document summary based on the client, and obtain the judgment document carried by the request;
    将所述裁判文书输入训练好的段落类别识别模型,得到所述裁判文书中各个段落的段落类别,所述段落类别包括第一类别及第二类别,将所述裁判文书中第一类别的段落的集合作为段落集;Input the judgment document into the trained paragraph category recognition model to obtain the paragraph category of each paragraph in the judgment document, where the paragraph category includes the first category and the second category, and the paragraphs of the first category in the judgment document are as a set of paragraphs;
    将所述段落集中各个段落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板;Perform similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured summary template, to obtain a target short sentence template corresponding to each paragraph in the paragraph set;
    将所述段落集中各个段落及其对应的目标短句模板输入训练好的摘要生成模型,得到所述段落集中各个段落对应的目标摘要短句,依照各个段落对应的目标短句模板在摘要模板中的位置顺序对所述目标摘要短句进行拼接,得到所述裁判文书对应的摘要文本。Inputting each paragraph in the paragraph set and its corresponding target short sentence template into the trained summary generation model, obtaining the target short sentence corresponding to each paragraph in the paragraph set, and according to the target short sentence template corresponding to each paragraph in the abstract template The target abstract short sentences are spliced in the order of their positions to obtain the abstract text corresponding to the judgment document.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述将所述段落集中各个段 落分别与预先配置的摘要模板中的每个短句模板进行相似度匹配,得到所述段落集中各个段落对应的目标短句模板,包括:The computer-readable storage medium according to claim 16, wherein the similarity matching between each paragraph in the paragraph set and each short sentence template in the preconfigured abstract template is performed to obtain each paragraph in the paragraph set The corresponding target phrase templates, including:
    计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值;Calculate the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template;
    当某一指定段落与多个短句模板的最长公共子序列相似度值大于相似度阈值时,将最高相似度值对应的短句模板作为所述指定段落对应的目标短句模板。When the similarity value of the longest common subsequence between a specified paragraph and multiple short sentence templates is greater than the similarity threshold, the short sentence template corresponding to the highest similarity value is used as the target short sentence template corresponding to the specified paragraph.
  18. 如权利要求17所述的计算机可读存储介质,其中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述裁判文书摘要生成程序被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 17, wherein after calculating the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template, the judgment document abstract generating program When executed by the processor, the following steps are also implemented:
    若某一指定段落与所述摘要模板中每个短句模板的最长公共子序列相似度值都小于相似度阈值,则从所述段落集中删除所述指定段落。If the longest common subsequence similarity value between a specified paragraph and each short sentence template in the abstract template is smaller than the similarity threshold, the specified paragraph is deleted from the paragraph set.
  19. 如权利要求17所述的计算机可读存储介质,其中,在计算所述段落集中各个段落与摘要模板中每个短句模板的最长公共子序列相似度值之后,所述裁判文书摘要生成程序被处理器执行时还实现如下步骤:The computer-readable storage medium according to claim 17, wherein after calculating the longest common subsequence similarity value of each paragraph in the paragraph set and each short sentence template in the abstract template, the judgment document abstract generating program When executed by the processor, the following steps are also implemented:
    若所述段落集中有多个段落对应同一个短句模板,则将所述多个段落按照其在所述裁判文书中的段落顺序进行合并形成所述段落集中的一个新段落。If there are multiple paragraphs in the paragraph set corresponding to the same short sentence template, the multiple paragraphs are merged according to their paragraph order in the judgment document to form a new paragraph in the paragraph set.
  20. 如权利要求17所述的计算机可读存储介质,其中,所述最长公共子序列相似度值的计算公式为:The computer-readable storage medium of claim 17, wherein the calculation formula of the longest common subsequence similarity value is:
    Figure PCTCN2021123175-appb-100005
    Figure PCTCN2021123175-appb-100005
    Figure PCTCN2021123175-appb-100006
    Figure PCTCN2021123175-appb-100006
    其中,p i为段落集中第i个段落,a j为摘要模板中第j个短句模板,LCS(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度,len(a j)为摘要模板中第j个短句模板的长度,len(p i)为段落集中第i个段落的长度,LCSR(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值上限,LCSP(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列长度比值下限,LCSFscore(p i,a j)为段落集中第i个段落与摘要模板中第j个短句模板的最长公共子序列相似度值。 Among them, pi is the ith paragraph in the paragraph set, a j is the jth short sentence template in the abstract template, LCS(pi ,a j ) is the ith paragraph in the paragraph set and the jth short sentence template in the abstract template The length of the longest common subsequence of , len(a j ) is the length of the j-th sentence template in the abstract template, len(pi ) is the length of the i - th paragraph in the paragraph set, and LCSR(pi , a j ) is The upper limit of the length ratio of the longest common subsequence between the i-th paragraph in the paragraph set and the j-th sentence template in the abstract template, LCSP(pi ,a j ) is the i -th paragraph in the paragraph set and the j-th short sentence in the abstract template The lower limit of the longest common subsequence length ratio of the template, LCSFscore(pi ,a j ) is the longest common subsequence similarity value between the ith paragraph in the paragraph set and the jth short sentence template in the abstract template.
PCT/CN2021/123175 2020-10-12 2021-10-12 Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium WO2022078308A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011087426.7A CN112182224A (en) 2020-10-12 2020-10-12 Referee document abstract generation method and device, electronic equipment and readable storage medium
CN202011087426.7 2020-10-12

Publications (1)

Publication Number Publication Date
WO2022078308A1 true WO2022078308A1 (en) 2022-04-21

Family

ID=73949353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123175 WO2022078308A1 (en) 2020-10-12 2021-10-12 Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium

Country Status (2)

Country Link
CN (1) CN112182224A (en)
WO (1) WO2022078308A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127977A (en) * 2023-02-08 2023-05-16 中国司法大数据研究院有限公司 Casualties extraction method for referee document
CN116188125A (en) * 2023-03-10 2023-05-30 深圳市伙伴行网络科技有限公司 Business invitation management method and device for office building, electronic equipment and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182224A (en) * 2020-10-12 2021-01-05 深圳壹账通智能科技有限公司 Referee document abstract generation method and device, electronic equipment and readable storage medium
CN113590809A (en) * 2021-07-02 2021-11-02 华南师范大学 Method and device for automatically generating referee document abstract
CN113255319B (en) * 2021-07-02 2021-10-26 深圳市北科瑞声科技股份有限公司 Model training method, text segmentation method, abstract extraction method and device
CN113704457B (en) * 2021-07-23 2024-03-01 北京搜狗科技发展有限公司 Method and device for generating abstract and storage medium
CN113609840B (en) * 2021-08-25 2023-06-16 西华大学 Chinese law judgment abstract generation method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126620A (en) * 2016-06-22 2016-11-16 北京鼎泰智源科技有限公司 Method of Chinese Text Automatic Abstraction based on machine learning
US20180213057A1 (en) * 2017-01-26 2018-07-26 Linkedin Corporation Customized profile summaries for online social networks
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN112182224A (en) * 2020-10-12 2021-01-05 深圳壹账通智能科技有限公司 Referee document abstract generation method and device, electronic equipment and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126620A (en) * 2016-06-22 2016-11-16 北京鼎泰智源科技有限公司 Method of Chinese Text Automatic Abstraction based on machine learning
US20180213057A1 (en) * 2017-01-26 2018-07-26 Linkedin Corporation Customized profile summaries for online social networks
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN112182224A (en) * 2020-10-12 2021-01-05 深圳壹账通智能科技有限公司 Referee document abstract generation method and device, electronic equipment and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127977A (en) * 2023-02-08 2023-05-16 中国司法大数据研究院有限公司 Casualties extraction method for referee document
CN116127977B (en) * 2023-02-08 2023-10-03 中国司法大数据研究院有限公司 Casualties extraction method for referee document
CN116188125A (en) * 2023-03-10 2023-05-30 深圳市伙伴行网络科技有限公司 Business invitation management method and device for office building, electronic equipment and storage medium
CN116188125B (en) * 2023-03-10 2024-05-31 深圳市伙伴行网络科技有限公司 Business invitation management method and device for office building, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112182224A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
WO2022078308A1 (en) Method and apparatus for generating judgment document abstract, and electronic device and readable storage medium
WO2022105115A1 (en) Question and answer pair matching method and apparatus, electronic device and storage medium
WO2022105122A1 (en) Answer generation method and apparatus based on artificial intelligence, and computer device and medium
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
WO2021042521A1 (en) Contract automatic generation method, computer device and computer non-volatile storage medium
WO2022048211A1 (en) Document directory generation method and apparatus, electronic device and readable storage medium
US20180285326A1 (en) Classifying and ranking changes between document versions
KR20200094627A (en) Method, apparatus, device and medium for determining text relevance
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
WO2022048210A1 (en) Named entity recognition method and apparatus, and electronic device and readable storage medium
CN111984793A (en) Text emotion classification model training method and device, computer equipment and medium
WO2020259280A1 (en) Log management method and apparatus, network device and readable storage medium
CN112183091A (en) Question and answer pair generation method and device, electronic equipment and readable storage medium
WO2021196825A1 (en) Abstract generation method and apparatus, and electronic device and medium
WO2022105496A1 (en) Intelligent follow-up contact method and apparatus, and electronic device and readable storage medium
WO2022222943A1 (en) Department recommendation method and apparatus, electronic device and storage medium
CN109101489A (en) A kind of text automatic abstracting method, device and a kind of electronic equipment
CN111414122A (en) Intelligent text processing method and device, electronic equipment and storage medium
CN111797217B (en) Information query method based on FAQ matching model and related equipment thereof
WO2023178978A1 (en) Prescription review method and apparatus based on artificial intelligence, and device and medium
WO2019085118A1 (en) Topic model-based associated word analysis method, and electronic apparatus and storage medium
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
CN112668281A (en) Automatic corpus expansion method, device, equipment and medium based on template
WO2023178979A1 (en) Question labeling method and apparatus, electronic device and storage medium
CN113486680B (en) Text translation method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21879352

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21879352

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 180723)

122 Ep: pct application non-entry in european phase

Ref document number: 21879352

Country of ref document: EP

Kind code of ref document: A1