WO2021051867A1 - 资产信息识别方法、装置、计算机设备及存储介质 - Google Patents
资产信息识别方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2021051867A1 WO2021051867A1 PCT/CN2020/093110 CN2020093110W WO2021051867A1 WO 2021051867 A1 WO2021051867 A1 WO 2021051867A1 CN 2020093110 W CN2020093110 W CN 2020093110W WO 2021051867 A1 WO2021051867 A1 WO 2021051867A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- asset
- paragraph
- information
- litigation
- participant
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000013145 classification model Methods 0.000 claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 37
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 239000000284 extract Substances 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 12
- 230000014509 gene expression Effects 0.000 claims description 10
- 230000015654 memory Effects 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000007787 long-term memory Effects 0.000 claims description 7
- 230000006403 short-term memory Effects 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 5
- 238000002059 diagnostic imaging Methods 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 3
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
Definitions
- This application relates to the field of information extraction in artificial intelligence, and in particular to an asset information identification method, device, computer equipment, and storage medium.
- This application provides an asset information identification method, device, equipment and storage medium to improve the accuracy of identifying asset information from legal documents.
- this application provides an asset information identification method, the method includes:
- the pre-trained text classification model and entity recognition model are used to extract information from the deconstructed fact paragraphs to obtain asset information.
- this application also provides an asset information identification device, which includes:
- a document analysis module for obtaining legal documents and analyzing the legal documents to obtain target paragraphs, the target paragraphs including litigation participant paragraphs and fact paragraphs;
- the litigation information module is used to extract information from the paragraphs of the litigation participants to obtain the litigation participant information
- the referential resolution module is used to refer to the fact paragraph according to the information of the litigation participant to obtain the resolved fact paragraph;
- the information extraction module is used to extract information from the deconstructed fact paragraphs by using a pre-trained text classification model and an entity recognition model to obtain asset information.
- the present application also provides a computer device including a memory and a processor; the memory is used to store computer-readable instructions; the processor is used to execute the computer-readable instructions and A method for identifying asset information is implemented when the computer-readable instruction is executed, wherein the method for identifying asset information includes;
- the pre-trained text classification model and entity recognition model are used to extract information from the deconstructed fact paragraphs to obtain asset information.
- the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the processor implements a An asset information identification method, wherein the asset information identification method includes the following steps:
- the pre-trained text classification model and entity recognition model are used to extract information from the deconstructed fact paragraphs to obtain asset information.
- This application discloses an asset information identification method, device, equipment and storage medium.
- a text classification model and an entity identification model By training a text classification model and an entity identification model, the identification and extraction of asset information in legal documents are completed, which is more versatile than the traditional rule traversal method. It can perform automatic recognition and improve the accuracy of information recognition.
- FIG. 1 is a schematic flowchart of steps of a method for training a text classification model provided by an embodiment of the present application
- FIG. 2 is a schematic flowchart of the steps of a method for training an entity recognition model provided by an embodiment of the present application
- FIG. 3 is a schematic flowchart of steps of a method for identifying asset information provided by an embodiment of the present application
- FIG. 4 is a schematic flowchart of sub-steps of the asset information identification method provided in FIG. 3;
- Figure 5 is a schematic flow chart of the steps for reference resolution of fact paragraphs
- FIG. 6 is a schematic flowchart of sub-steps of the asset information identification method provided in FIG. 3;
- FIG. 7 is a schematic block diagram of an asset information identification device provided by an embodiment of the application.
- Figure 8 is a schematic block diagram of the structure of a computer device provided by an embodiment of this application.
- the embodiments of the present application provide an asset information identification method, device, computer equipment, and storage medium, which involve the field of information extraction in artificial intelligence.
- the asset information identification method can be used to identify and extract asset information in documents to improve the accuracy of information identification.
- an instrument refers to an instrument with a specific format. The following takes a legal instrument as an example for detailed explanation.
- FIG. 1 is a schematic flowchart of a method for training a text classification model provided by an embodiment of the present application.
- the training method of the text classification model is obtained by model training based on a convolutional neural network, of course, it can also be obtained by training through other networks.
- the text classification model is the TextCNN text classification model.
- TextCNN applies the convolutional neural network CNN to the text classification task, and uses multiple convolution kernels of different sizes to extract local features of the text.
- the text is converted into a fixed-dimensional feature vector, and a classifier is trained based on this feature vector. Since the expression mode of legal documents is relatively obvious, it is suitable to adopt this shallow text classification model.
- the training method of the text classification model specifically includes: step S101 to step S103.
- the first asset key sentence refers to a sentence that includes asset keywords.
- the asset keywords can be buildings, real estate, real estate, houses, housing, guarantees, bonds, deposits, etc. Select sentences that include asset keywords in legal documents and use them as the first asset key sentence to train the text classification model.
- asset classification categories can include two major categories, namely asset categories and non-asset categories.
- asset categories include 5 sub-categories, specifically real estate, land, vehicles, deposits, loans, and guarantees.
- the number of sample data can be 20,000
- the ratio of real estate: land: vehicle: deposit: loan: guarantee: non-asset category can be 2:1:1:1:2:2:1.
- the first asset key sentence is marked according to the category identifier corresponding to the asset classification category, and the first asset key sentence is classified according to the category, so that the first asset key sentence is extracted according to the classification category to construct sample data.
- it can be extracted according to the proportion of asset classes and non-asset classes.
- S103 Based on the convolutional neural network, perform model training and verification according to the sample data to obtain a text classification model, and use the text classification model as a pre-trained text classification model.
- the training set is used to perform model training based on the convolutional neural network to obtain a text classification model
- the verification set is used to verify the accuracy of the obtained text classification model.
- the ratio of the training set and the validation set can be 7:3, and the ratio of each asset category and non-asset category in the training set and the validation set is the same as the ratio in the sample data.
- the constructed sample data is used to train a text classification model through a convolutional neural network, a training set is used to train a text classification model, and a verification set is used to verify the trained text classification model, and finally a text classification model is obtained.
- the training method provided in the above embodiment obtains the first asset key sentence, and then classifies the first asset key sentence according to the asset classification category to obtain sample data; finally, based on the convolutional neural network, the model is trained according to the constructed sample data , To get the text classification model.
- the text classification model can be applied to the asset information identification method, thereby improving the accuracy and versatility of asset information identification.
- FIG. 2 is a schematic flowchart of an entity recognition model training method provided by an embodiment of the present application.
- the training method of the entity recognition model is based on the long and short-term memory network for model training, of course, it can also be trained by other networks.
- the entity recognition model is BiLSTM+CRF entity recognition model.
- BiLSTM uses a two-way long and short-term memory network to obtain the scores of each character on various entity tags. CRF learns these from the training data. The constraint condition of the entity tag finally obtains the entity tag of each character to realize the entity recognition.
- the training method of the entity recognition model specifically includes: step S201 to step S203.
- the second asset key sentence refers to a sentence that includes asset keywords, asset attributes, and asset owners.
- the asset attribute can be movable property, real property, etc.
- the owner of the asset refers to the owner of the asset.
- the asset attributes included in the second asset key sentence may be directly reflected in the sentence, or judged based on the asset keywords included in the sentence.
- the key sentence of the second asset could be "the plaintiff spent 6 million yuan to purchase a real estate from the court, and the real estate is a real estate located in Songjiang District, Shanghai.” It could also be "the plaintiff spent 6 million yuan to buy from the court.” A property.”
- S202 Mark the asset keywords, asset attributes, and asset owners respectively to construct sample data.
- the BIO tagging set can be used to tag asset keywords, asset attributes, and asset owners, using B-entity tag name to indicate the first word of the entity, I-entity tag name to indicate the non-first word of the entity, and O to indicate non-entity section.
- B-entity tag name to indicate the first word of the entity
- I-entity tag name to indicate the non-first word of the entity
- O to indicate non-entity section.
- S203 Perform model training and verification according to the sample data based on the long and short-term memory network to obtain an entity recognition model, and use the entity recognition model as a pre-trained entity recognition model.
- the sample data before performing model training and verification on the sample data, it may further include: dividing the sample data according to a preset ratio to obtain a training set and a verification set.
- the training set is used to perform model training based on the long and short-term memory network to obtain an entity recognition model
- the verification set is used to verify the accuracy of the obtained entity recognition model.
- the ratio of training set and validation set can be 7:3.
- the constructed sample data is used to train the entity recognition model through the long and short-term memory network
- the training set is used for the entity recognition model training
- the verification set is used to verify the trained entity recognition model
- the training method provided in the foregoing embodiment obtains the second asset key sentence, and then annotates the asset keywords, asset attributes, and asset owner in the second asset key sentence to construct sample data; finally, it is based on the long and short-term memory network according to Model training is performed on the constructed sample data to obtain an entity recognition model, and the entity recognition model can be applied to an asset information recognition method, thereby improving the accuracy and versatility of asset information recognition.
- the asset information recognition method can be applied to a terminal or a server, it is necessary to save the trained text classification model and entity recognition model in the terminal or the server.
- the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device;
- the server can be an independent server or a server cluster.
- the compression processing specifically includes pruning processing, quantization processing and Huffman encoding processing on the text classification model and entity recognition model, etc., to reduce the size of the text classification model and entity recognition model, and then save it in a small capacity In the terminal.
- FIG. 3 is a schematic flowchart of an asset information identification method provided by an embodiment of the present application.
- the asset information identification method can be applied to a terminal or a server to identify and extract asset information in legal documents.
- the asset information identification method specifically includes steps S301 to S304.
- the format of the obtained legal document may be PDF format, or DOC or DOCX format.
- the paragraph of the litigation participant refers to the paragraph that includes the plaintiff's basic information and the court's basic information.
- Both the plaintiff's basic information and the court's basic information can include: name, company name, domicile, registered place, place of business, gender, date of birth One or more of.
- a factual paragraph refers to a paragraph that includes information such as the cause of the case and facts of the case as found by the trial court.
- parsing the legal document to obtain the target paragraph specifically includes: matching the legal document according to writing keywords to segment the legal document to obtain the target paragraph.
- the analysis of the legal document may be the analysis of the legal document using writing keywords.
- Writing keywords refer to some keywords commonly used when writing different paragraphs in various types of legal documents, such as: plaintiff, lawyer, authorized agent, legal representative, registered place, place of business, plaintiff, After the trial, it was found out, this court believed, the trial ended, and so on.
- a writing keyword database can be established according to the writing keywords, so that the writing keywords in the writing keyword database can be used to analyze the legal documents.
- the writing keyword "case number” is matched at the beginning of the first paragraph in a legal document
- the writing keyword "end of trial” is matched at the end of the sixth paragraph
- the second paragraph to the first paragraph are determined.
- the five paragraphs are all litigation participant paragraphs, and the second to fifth paragraphs are regarded as the target paragraphs; when the writing keyword "ascertained by the trial” is matched at the beginning of the eighth paragraph in the legal document, and in the tenth paragraph
- the eighth to tenth paragraphs are determined to be factual paragraphs, and the eighth to tenth paragraphs are regarded as the target paragraphs.
- the litigation participant information includes the name of the litigation participant and the court trial status of the litigation participant, where the court trial status refers to whether the litigation participant is the court or the plaintiff in the case.
- step S302 specifically includes: step S302a and step S302b.
- S302a Perform matching in the paragraph of the litigation participant according to the name keywords to obtain a target sentence that matches the name keywords.
- name keywords refer to pronouns used to refer to specific nouns or characters.
- name keywords may include: plaintiff, authorized agent, legal representative, lawyer, and so on. After obtaining the litigation participant's paragraph, match the litigation participant's paragraph according to the name keyword, and then determine the sentence matching the name keyword from the litigation participant's paragraph, and use the sentence as the target sentence.
- the regular expression can extract substrings from a character string using a predetermined composition rule, so that a specific text in the document can be searched.
- the target sentence with name keywords is obtained from the paragraph of the litigation participant, and then the name of the litigation participant and the position of the court trial are extracted from the target sentence using regular expressions, which improves the ability to extract the information of the litigation participant in the paragraph of the litigation participant. Speed and efficiency.
- S303 Perform reference resolution to the fact paragraph according to the information of the litigation participant to obtain the resolved fact paragraph.
- resolution refers to determining which noun the pronouns used in the paragraph point to in a paragraph, and replacing the used pronouns with the corresponding nouns.
- the factual paragraph is resolved to obtain a complete factual paragraph.
- step S303a specifically includes step S303a and step S303b.
- the corresponding relationship between the name of the litigation participant and the corresponding trial status is established.
- the target sentence is: "Plaintiff: Zhang San, male, Born on May 12, 1970, living in xx Street xx Lane xx.”
- the trial status is: Plaintiff
- the corresponding relationship between the plaintiff and Zhang San is established .
- asset information includes asset owners, related parties, and asset attributes.
- the asset attribute may be predefined, for example, movable property, real property, etc.
- the related party may be a third party that is related to the owner of the asset.
- step S304 specifically includes step S304a to step S304c.
- S304a Perform matching in the deconstructed fact paragraph according to the asset keyword to obtain an initial asset key sentence that matches the asset keyword.
- asset keywords can be artificially summarized.
- Asset keywords can be, for example, building, real estate, real estate, house, housing, business building, commercial and residential building, commercial building, building, storefront, land use right, land, homestead , Land, vehicles, guarantees, guarantees, bonds, deposits, etc.
- the sentence that matches the asset keyword is used as the initial asset key sentence, and the sentence that does not match the asset keyword can be directly filtered.
- S304b Use a pre-trained text classification model to filter the initial asset key sentences to obtain target asset key sentences.
- S304c Perform asset information identification on the key sentence of the target asset based on a pre-trained entity recognition model to obtain asset owners, related parties, and asset attributes.
- the target asset key sentence is a sentence including asset keywords and asset attributes.
- the asset owner, related parties, and asset attributes mentioned in the target asset key sentence can be obtained.
- the above asset information identification method analyzes the obtained legal documents to obtain the litigation participant paragraph and the fact paragraph; then extracts the litigation participant paragraph to obtain the litigation participant information; uses the litigation participant information to perform the fact paragraph Refers to resolution to obtain the resolved fact paragraph; finally, use the pre-trained text classification model and entity recognition model to extract information from the resolved fact paragraph to obtain asset information.
- By training the text classification model and entity recognition model make full use of the sentence classification model and entity recognition model on the basis of keyword matching to complete the identification and extraction of asset information in legal documents, which is more versatile than the traditional rule traversal method. Carry out automatic identification and improve the accuracy of asset information identification.
- FIG. 7 is a schematic block diagram of an asset information identification device according to an embodiment of the present application.
- the asset information identification device is used to execute the aforementioned asset information identification method.
- the asset information identification device can be configured in a server or a terminal.
- the server can be an independent server or a server cluster.
- the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
- the asset information identification device 400 includes: a document analysis module 401, a litigation information module 402, a reference resolution module 403, and an information extraction module 404.
- the document analysis module 401 is configured to obtain a legal document and analyze the legal document to obtain a target paragraph.
- the target paragraph includes a litigation participant paragraph and a fact paragraph.
- the document analysis module 401 is specifically configured to match the legal document according to writing keywords to segment the legal document to obtain the target paragraph.
- the litigation information module 402 is used to extract information from the paragraph of the litigation participant to obtain the litigation participant information.
- the litigation information module 402 includes a name matching sub-module 4021 and a regular acquisition sub-module 4022.
- the name matching submodule 4021 is used for matching in the paragraph of the litigation participant according to the name keywords to obtain the target sentence that matches the name keywords;
- the regular acquisition submodule 4022 is used for using regularization
- the expression obtains the name and trial status of the litigation participant from the target sentence, and uses the name and trial status of the litigation participant as the litigation participant information.
- the reference resolution module 403 is used to resolve the fact paragraph based on the information of the litigation participant to obtain the resolved fact paragraph.
- the reference resolution module 403 includes a correspondence establishment sub-module 4031 and a pronoun replacement sub-module 4032.
- the correspondence establishment sub-module 4031 is used to establish the corresponding relationship between the name of the litigation participant and the court trial status;
- the pronoun replacement sub-module 4032 is used to compare the court trial status pronouns in the fact paragraph based on the corresponding relationship Replace it to complete the reference resolution of the fact paragraph.
- the information extraction module 404 is configured to use a pre-trained text classification model and an entity recognition model to extract information from the deconstructed fact paragraphs to obtain asset information.
- the information extraction module 404 includes an initial matching submodule 4041, a sentence filtering submodule 4042, and an information recognition submodule 4043.
- the initial matching submodule 4041 is used for matching in the deconstructed fact paragraphs according to the asset keywords to obtain the initial asset key sentences that match the asset keywords;
- the sentence filtering submodule 4042 uses To use a pre-trained text classification model to filter the initial asset key sentence to obtain the target asset key sentence;
- the information recognition sub-module 4043 is used to perform asset information on the target asset key sentence based on the pre-trained entity recognition model Identification to obtain asset owners, related parties and asset attributes.
- the above asset information identification device may be implemented in a form of computer readable instructions, and the computer readable instructions may run on the computer equipment as shown in FIG. 8.
- FIG. 8 is a schematic block diagram of the structure of a computer device provided by an embodiment of the present application.
- the computer equipment can be a server or a terminal.
- the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
- the non-volatile storage medium can store an operating system and computer readable instructions.
- the computer-readable instructions include program instructions, and when the program instructions are executed, the processor can execute the asset information identification method shown in any of the foregoing embodiments.
- the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
- the internal memory provides an environment for the operation of computer-readable instructions in the non-volatile storage medium.
- the processor can execute the asset information identification method shown in any of the above embodiments. .
- the network interface is used for network communication, such as sending assigned tasks.
- FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
- the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
- the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors or digital signal processors. (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
- the processor is configured to run computer-readable instructions stored in the memory to implement the following steps:
- the litigation participant information refers to the fact paragraph to be resolved to obtain the resolved fact paragraph;
- the pre-trained text classification model and entity recognition model are used to extract information from the resolved fact paragraph to obtain asset information .
- Computer-readable instructions Computer-readable instructions
- the embodiments of the present application also provide a computer-readable storage medium.
- the computer-readable storage medium stores computer-readable instructions.
- the computer-readable storage medium may be non-volatile. , It may also be volatile, the computer-readable instructions include program instructions, and the processor executes the program instructions to implement the asset information identification method shown in any of the foregoing embodiments provided by the embodiments of the present application ,
- the asset information identification method includes the following steps:
- the pre-trained text classification model and entity recognition model are used to extract information from the deconstructed fact paragraphs to obtain asset information.
- the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, for example, the hard disk or memory of the computer device.
- the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk or a smart memory card (Smart Memory Card) equipped on the computer device.
- Media Card, SMC Secure Digital (Secure Digital, SD) card, flash memory card (Flash Card) and so on.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Technology Law (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本申请涉及人工智能中的数据处理领域,公开了一种资产信息识别方法、装置、计算机设备及存储介质,所述方法包括:获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。通过训练文本分类模型和实体识别模型,完成法律文书中资产信息的识别和提取,比传统的规则遍历法更有通用性,能够进行自动识别,并且提高信息识别的准确率。
Description
本申请要求于2019年09月18日提交中国专利局、申请号为201910882814.5,发明名称为“资产信息识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能中的信息提取领域,尤其涉及一种资产信息识别方法、装置、计算机设备及存储介质。
目前,在法律文书中存在大量的资产线索,通过对法律文书中资产线索的分析,可以了解到公司或个人历史的资产纠纷,以及法院对于存在纠纷的资产的查明结果,这些线索在对公司进行不良资产清收、上市公司风险评级等方面都具有重要意义。现有的是在法律文书中利用规则遍历法,对法律文书全文进行资产线索的查找,但是发明人意识到规则遍历法在进行线索识别时,其准确率较低。
因此,如何提高从法律文书中识别资产信息的准确率成为亟待解决的问题。
本申请提供了一种资产信息识别方法、装置、设备及存储介质,以提高从法律文书中识别资产信息的准确率。
为实现上述目的,第一方面,本申请提供了一种资产信息识别方法,所述方法包括:
获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;
对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;
根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;
利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
第二方面,本申请还提供了一种资产信息识别装置,所述装置包括:
文书解析模块,用于获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;
诉讼信息模块,用于对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;
指代消解模块,用于根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;
信息提取模块,用于利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
第三方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机可读指令;所述处理器,用于执行所述计算机可读指令并在执行所述计算机可读指令时实现一种资产信息识别方法,其中,所述资产信息识别方法包括;
获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;
对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;
根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;
利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
第四方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使所述处理器实现一种资产信息识别方法,其中,所述资产信息识别方法包括以下步骤;
获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;
对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;
根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;
利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
本申请公开了一种资产信息识别方法、装置、设备及存储介质,通过训练文本分类模型和实体识别模型,完成法律文书中资产信息的识别和提取,比传统的规则遍历法更有通用性,能够进行自动识别,并且提高信息识别的准确率。
图1是本申请实施例提供的一种文本分类模型的训练方法的步骤示意流程图;
图2是本申请实施例提供的一种实体识别模型的训练方法的步骤示意流程图;
图3是本申请的实施例提供的一种资产信息识别方法的步骤示意流程图;
图4是图3中提供的一种资产信息识别方法的子步骤示意流程图;
图5是对事实段落进行指代消解的步骤示意流程图;
图6是图3中提供的一种资产信息识别方法的子步骤示意流程图;
图7为本申请实施例提供的一种资产信息识别装置的示意性框图;
图 8为本申请一实施例提供的一种计算机设备的结构示意性框图。
为了解决上述问题,本申请的实施例提供了一种资产信息识别方法、装置、计算机设备及存储介质,涉及人工智能中的信息提取领域。资产信息识别方法可用于对文书中的资产信息进行识别和提取,提高信息识别的准确率。其中,文书是指具有特定格式的文书,以下以法律文书为例进行详细说明。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参阅图1,图1是本申请实施例提供的一种文本分类模型的训练方法的示意流程图。该文本分类模型的训练方法是基于卷积神经网络进行模型训练得到的,当然也可以通过其他网络进行训练得到。
需要说明的是,在本实施例中,文本分类模型为TextCNN文本分类模型,TextCNN将卷积神经网络CNN应用到文本分类任务,利用多个不同尺寸的卷积核来抽取文本的局部特征,将文本转换成一个固定维度的特征向量,并基于此特征向量训练一个分类器。由于法律文书的表述模式比较明显,适合采用这种浅层文本分类模型。
如图1所示,该文本分类模型的训练方法,具体包括:步骤S101至步骤S103。
S101、获取第一资产关键句,所述第一资产关键句为包括资产关键词的句子。
其中,第一资产关键句是指包括资产关键词的句子。其中,资产关键词可以是楼房、房产、房地产、房屋、住房、保证、债券、存款等。选取法律文书中包括资产关键词的句子,并将其作为第一资产关键句,用于训练文本分类模型。
S102、根据资产分类类别对应的类别标识对所述第一资产关键句进行标记,以构建样本数据。
具体地,资产分类类别可以包括2大类,分别为资产类别和非资产类别,其中,资产类别又包括5个小类,具体为房产、土地、车辆、存款、借款、担保。在具体实施过程中,样本数据的数量可以为2万条,并且房产:土地:车辆:存款:借款:担保:非资产类别的比例可以为2:1:1:1:2:2:1。
具体地,根据资产分类类别对应的类别标识对所述第一资产关键句进行标记,对第一资产关键句按照类别进行分类,从而按照分类类别提取第一资产关键句构建样本数据。在具体实施过程中,可以按照资产类别和非资产类别的比例进行提取。
S103、基于卷积神经网络,根据所述样本数据进行模型训练和验证以得到文本分类模型,并将所述文本分类模型作为预先训练的文本分类模型。
在具体实施过程中,在对样本数据进行模型训练验证之前,还可以包括:按照预设比例对所述样本数据进行划分,以获得训练集和验证集。训练集用于基于卷积神经网络进行模型训练以得到文本分类模型,验证集用于对得到的文本分类模型进行准确性的验证。其中,训练集和验证集的比例可以为7:3,并且,训练集和验证集中各资产类别和非资产类别的比例与样本数据中的比例相同。
具体地,使用构建的样本数据,通过卷积神经网络进行文本分类模型训练,利用训练集进行文本分类模型训练,利用验证集对训练得到的文本分类模型进行验证,最终得到文本分类模型。
上述实施例提供的训练方法,通过获取第一资产关键句,然后根据资产分类类别对第一资产关键句进行分类,以得到样本数据;最后基于卷积神经网络,根据构建的样本数据进行模型训练,以得到文本分类模型。使得所述文本分类模型能够应用于资产信息识别方法中,由此提高资产信息识别的准确度和通用性。
请参阅图2,图2是本申请实施例提供的一种实体识别模型的训练方法的示意流程图。该实体识别模型的训练方法是基于长短期记忆网络进行模型训练得到的,当然也可以通过其他网络进行训练得到。
需要说明的是,在本实施例中,实体识别模型为BiLSTM+CRF实体识别模型,BiLSTM使用双向的长短期记忆网络得到每个字符在各类实体标签上的得分,CRF从训练数据中学习这些实体标签的约束条件,最终获得每个字符的实体标签,实现实体识别。
如图1所示,该实体识别模型的训练方法,具体包括:步骤S201至步骤S203。
S201、获取第二资产关键句,所述第二资产关键句为包括资产关键词、资产属性和资产所有人的句子。
其中,第二资产关键句是指包括资产关键词、资产属性和资产所有人的句子。其中,资产属性可以是动产、不动产等,资产所有人是指资产的所有权人。
在具体实施过程中,第二资产关键句中所包括的资产属性,可以是直接体现在句子内的,也可以是根据句子中所包括的资产关键词所进行判断的。
例如第二资产关键句可以为“原告花费600万元从被告处购买了一处不动产,该不动产为位于上海市松江区的一处房产。”还可以为“原告花费600万元从被告处购买了一处房产。”
S202、对所述资产关键词、资产属性和资产所有人分别进行标注,以构建样本数据。
具体地,可以采用BIO标注集对资产关键词、资产属性和资产所有人进行标注,以B-实体标签名表示实体首字,以I-实体标签名表示实体非首字,以 O表示非实体部分。在标注完成后,以标注后的资产关键词、资产属性和资产所有人数据构建样本数据。
S203、基于长短期记忆网络,根据所述样本数据进行模型训练和验证以得到实体识别模型,并将所述实体识别模型作为预先训练的实体识别模型。
在具体实施过程中,在对样本数据进行模型训练验证之前,还可以包括:按照预设比例对所述样本数据进行划分,以获得训练集和验证集。训练集用于基于长短期记忆网络进行模型训练以得到实体识别模型,验证集用于对得到的实体识别模型进行准确性的验证。其中,训练集和验证集的比例可以为7:3。
具体地,使用构建的样本数据,通过长短期记忆网络进行实体识别模型训练,利用训练集进行实体识别模型训练,利用验证集对训练得到的实体识别模型进行验证,最终得到实体识别模型。
上述实施例提供的训练方法,通过获取第二资产关键句,然后对第二资产关键句中的资产关键词、资产属性和资产所有人进行标注,以构建样本数据;最后基于长短期记忆网络根据构建的样本数据进行模型训练,以得到实体识别模型,并使得所述实体识别模型能够应用于资产信息识别方法中,由此提高资产信息识别的准确度和通用性。
需要说明的是,由于资产信息识别方法可以应用于终端或服务器中,因此需要将训练好的文本分类模型和实体识别模型保存在终端或服务器中。其中,该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备;服务器可以为独立的服务器,也可以为服务器集群。
如果是应用于终端中,为了保证该终端的正常运行以及快速识别检测出运动目标的类别,还需要对训练得到的文本分类模型和实体识别模型进行压缩处理,将压缩处理后的模型保存在终端。
其中,该压缩处理具体包括对文本分类模型和实体识别模型进行剪枝处理、量化处理和哈夫曼编码处理等,以减小文本分类模型和实体识别模型的大小,进而方便保存在容量较小的终端中。
请参阅图3,图3是本申请的实施例提供的一种资产信息识别方法的示意流程图。该资产信息识别方法可以应用在终端或服务器中,以对法律文书中的资产信息进行识别和提取。
如图3所示,该资产信息识别方法,具体包括步骤S301至步骤S304。
S301、获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落。
具体地,获取到的法律文书的格式可以是PDF格式,也可以是DOC或者DOCX格式。
诉讼参与人段落是指包括原告基本信息的段落和被告基本信息的段落,其中,原告基本信息和被告基本信息均可以包括:姓名、公司名称、住所地、注册地、经营地、性别、出生日期中的一种或多种。
事实段落是指包括审理法院所查明的案由、案件事实等信息的段落。
在一些实施例中,对所述法律文书进行解析,以获得目标段落,具体包括:根据写作关键词对所述法律文书进行匹配以对所述法律文书进行分段,获得目标段落。
具体地,对法律文书进行解析可以是利用写作关键词对法律文书进行解析。写作关键词是指各个类型的法律文书中,对不同段落进行写作时通常会使用的一些关键词,例如:原告、被告、委托代理人、法定代表人、注册地、经营地、原告诉称、经审理查明、本院认为、审理终结等等。在具体实施过程中,可以根据写作关键词建立写作关键词库,从而利用写作关键词库中的写作关键词对法律文书进行解析。
例如,当在法律文书中第一段的起始处匹配到写作关键词“案号”,并在第六段的结尾处匹配到写作关键词“审理终结”时,则确定第二段至第五段均为诉讼参与人段落,并将第二段至第五段作为目标段落;当在法律文书中第八段的起始处匹配到写作关键词“经审理查明”,并在第十一段的起始处匹配到写作关键词“本院认为”时,则确定自第八段至第十段均为事实段落,并将第八段至第十段作为目标段落。
S302、对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息。
具体地,诉讼参与人信息包括诉讼参与人名称和诉讼参与人对应的庭审地位,其中庭审地位是指诉讼参与人在该案中是被告还是原告。
在一些实施例中,为了提高获取诉讼参与人信息的效率,请参阅图4,步骤S302具体包括:步骤S302a和步骤S302b。
S302a、根据名称关键词在所述诉讼参与人段落中进行匹配,以获得与所述名称关键词相匹配的目标句子。
具体地,名称关键词是指用于指代具体名词或人物的代词,例如,名称关键词可以包括:原告、委托代理人、法定代表人、被告等等。在获取到诉讼参与人段落后,根据名称关键词在诉讼参与人段落中进行匹配,进而从诉讼参与人段落中确定与名称关键词相匹配的句子,并将该句子作为目标句子。
S302b、利用正则表达式从所述目标句子中获取诉讼参与人名称和庭审地位,将所述诉讼参与人名称和庭审地位作为诉讼参与人信息。
具体地,正则表达式可以利用预先给定的组成规则从字符串中提取子字符串,使得能够查找文档内特定的文本。
例如,当目标句子为:“原告:张三,男,1970年5月12日生,住xx街xx巷xx。”时,利用正则表达式:^原告:*$,在该目标句子中获取诉讼参与人名称为:张三,庭审地位为:原告。
首先从诉讼参与人段落中匹配得到具有名称关键词的目标句子,再利用正则表达式在目标句子中提取出诉讼参与人名称和庭审地位,提高了在诉讼参与人段落中提取诉讼参与人信息的速度和效率。
S303、根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落。
具体地,指代消解是指在段落中确定段落中所使用的代词指向哪个名词,并将所使用的代词替换为对应的名词。根据诉讼参与人信息对事实段落进行指代消解,以得到完整的事实段落。
在一些实施例中,请参阅图5,对事实段落进行指代消解具体包括步骤S303a和步骤S303b。
S303a、建立所述诉讼参与人名称与所述庭审地位之间的对应关系。
具体地,从目标句子中获取到诉讼参与人名称和庭审地位后,建立诉讼参与人名称和其对应的庭审地位之间的对应关系,例如,当目标句子为:“原告:张三,男,1970年5月12日生,住xx街xx巷xx。”时,从该目标句子中获取得到诉讼参与人名称为:张三,庭审地位为:原告,则建立原告与张三之间的对应关系。
S303b、基于所述对应关系对事实段落中的庭审地位代词进行替换,以完成事实段落的指代消解。
具体地,获取到事实段落后,对事实段落中的庭审地位代词进行查找,并基于诉讼参与人名称与庭审地位之间的对应关系,将事实段落中的庭审地位代词进行替换,从而完成对事实段落的指代消解。
例如,当事实段落为“2012年7月,原告、被告签订医学影像打印系统经销合作协议一份,约定被告向原告购买医学影像打印系统及服务,仅限销售给淮安二院。”其中,原告与张三对应,被告与李四对应。
对上述事实段落进行指代消解后,得到的段落为“2012年7月,张三、李四签订医学影像打印系统经销合作协议一份,约定李四向张三购买医学影像打印系统及服务,仅限销售给淮安二院。”
S304、利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
具体地,资产信息包括资产所有人、关联方和资产属性。所述资产属性可以是预定义的,例如,动产、不动产等等,关联方可以是与资产所有人具有关联的第三方。
在一些实施例中,为了提高从事实段落中获取资产信息的效率,请参阅图6,步骤S304具体地包括步骤S304a至步骤S304c。
S304a、根据资产关键词在所述消解后的事实段落中进行匹配,以获得与所述资产关键词相匹配的初始资产关键句。
其中,资产关键词可以是人工总结归纳的,资产关键词可以是例如楼房、房产、房地产、房屋、住房、营业房、商住楼、商用楼、大厦、店面、土地使用权、土地、宅基地、用地、车辆、担保、保证、债券、存款等等。根据资产关键词在消解后的事实段落中进行匹配,匹配到包括资产关键词的句子作为初始资产关键句,而匹配不到资产关键词的句子可以直接过滤。
S304b、利用预先训练的文本分类模型对所述初始资产关键句进行过滤,以获得目标资产关键句。
利用预先训练的文本分类模型对初始资产关键句进行过滤,从而过滤掉初始资产关键句中仅包含资产关键词,而不存在资产属性的句子,并将未被文本分类模型过滤掉的初始资产关键句作为目标资产关键句。
S304c、基于预先训练的实体识别模型对所述目标资产关键句进行资产信息的识别,以获得资产所有人、关联方和资产属性。
其中,目标资产关键句为包括资产关键词和资产属性的句子,利用预先训练的实体识别模型,可以得到目标资产关键句中所提到的资产所有人、关联方和资产属性。
上述资产信息识别方法通过对获取到的法律文书进行解析,从而获得诉讼参与人段落和事实段落;然后对诉讼参与人段落进行信息提取,得到诉讼参与人信息;利用诉讼参与人信息对事实段落进行指代消解,从而获得消解后的事实段落;最后利用预先训练的文本分类模型和实体识别模型对消解后的事实段落进行信息提取,从而获得资产信息。通过训练文本分类模型和实体识别模型,在关键词匹配的基础上充分利用句子分类模型和实体识别模型,完成法律文书中资产信息的识别和提取,比传统的规则遍历法更有通用性,能够进行自动识别,并且提高资产信息识别的准确率。
请参阅图7,图7是本申请的实施例还提供一种资产信息识别装置的示意性框图,该资产信息识别装置用于执行前述的资产信息识别方法。其中,该资产信息识别装置可以配置于服务器或终端中。
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。
如图7所示,资产信息识别装置400包括:文书解析模块401、诉讼信息模块402、指代消解模块403和信息提取模块404。
文书解析模块401,用于获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落。
其中,文书解析模块401具体用于根据写作关键词对所述法律文书进行匹配以对所述法律文书进行分段,获得目标段落。
诉讼信息模块402,用于对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息。
其中,诉讼信息模块402包括名称匹配子模块4021和正则获取子模块4022。
具体地,名称匹配子模块4021,用于根据名称关键词在所述诉讼参与人段落中进行匹配,以获得与所述名称关键词相匹配的目标句子;正则获取子模块4022,用于利用正则表达式从所述目标句子中获取诉讼参与人名称和庭审地位,将所述诉讼参与人名称和庭审地位作为诉讼参与人信息。
指代消解模块403,用于根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落。
其中,指代消解模块403包括对应建立子模块4031和代词替换子模块4032。
具体地,对应建立子模块4031,用于建立所述诉讼参与人名称与所述庭审地位之间的对应关系;代词替换子模块4032,用于基于所述对应关系对事实段落中的庭审地位代词进行替换,以完成事实段落的指代消解。
信息提取模块404,用于利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
其中,信息提取模块404包括初始匹配子模块4041、句子过滤子模块4042和信息识别子模块4043。
具体地,初始匹配子模块4041,用于根据资产关键词在所述消解后的事实段落中进行匹配,以获得与所述资产关键词相匹配的初始资产关键句;句子过滤子模块4042,用于利用预先训练的文本分类模型对所述初始资产关键句进行过滤,以获得目标资产关键句;信息识别子模块4043,用于基于预先训练的实体识别模型对所述目标资产关键句进行资产信息的识别,以获得资产所有人、关联方和资产属性。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的资产信息识别装置和各模块的具体工作过程,可以参考前述资产信息识别方法实施例中的对应过程,在此不再赘述。
上述的资产信息识别装置可以实现为一种计算机可读指令的形式,该计算机可读指令可以在如图8所示的计算机设备上运行。
请参阅图8,图8是本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器或终端。
参阅图8,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。
非易失性存储介质可存储操作系统和计算机可读指令。该计算机可读指令包括程序指令,该程序指令被执行时,可使得处理器执行上述的任一实施例所示出的资产信息识别方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机可读指令的运行提供环境,该计算机可读指令被处理器执行时,可使得处理器执行上述的任一实施例所示出的资产信息识别方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元 (Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器
(Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现场可编程门阵列
(Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机可读指令,以实现如下步骤:
获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
计算机可读指令计算机可读指令本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读指令中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的上述的任一实施例所示出的资产信息识别方法,所述资产信息识别方法包括以下步骤:
获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;
对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;
根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;
利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart
Media Card,SMC),安全数字(Secure
Digital,SD)卡,闪存卡(Flash
Card)等。
Claims (20)
- 一种资产信息识别方法,其中,包括:获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
- 根据权利要求1所述的资产信息识别方法,其中,所述利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息,包括:根据资产关键词在所述消解后的事实段落中进行匹配,以获得与所述资产关键词相匹配的初始资产关键句;利用预先训练的文本分类模型对所述初始资产关键句进行过滤,以获得目标资产关键句;基于预先训练的实体识别模型对所述目标资产关键句进行资产信息的识别,以获得资产所有人、关联方和资产属性。
- 根据权利要求1所述的资产信息识别方法,其中,还包括:获取第一资产关键句,所述第一资产关键句为包括资产关键词的句子;根据资产分类类别对应的类别标识对所述第一资产关键句进行标记,以构建样本数据;基于卷积神经网络,根据所述样本数据进行模型训练和验证以得到文本分类模型,并将所述文本分类模型作为预先训练的文本分类模型。
- 根据权利要求1所述的资产信息识别方法,其中,还包括:获取第二资产关键句,所述第二资产关键句为包括资产关键词、资产属性和资产所有人的句子;对所述资产关键词、资产属性和资产所有人分别进行标注,以构建样本数据;基于长短期记忆网络,根据所述样本数据进行模型训练和验证以得到实体识别模型,并将所述实体识别模型作为预先训练的实体识别模型。
- 根据权利要求1所述的资产信息识别方法,其中,所述对所述法律文书进行解析,以获得目标段落,包括:根据写作关键词对所述法律文书进行匹配以对所述法律文书进行分段,获得目标段落。
- 根据权利要求1所述的资产信息识别方法,其中,所述对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息,包括:根据名称关键词在所述诉讼参与人段落中进行匹配,以获得与所述名称关键词相匹配的目标句子;利用正则表达式从所述目标句子中获取诉讼参与人名称和庭审地位,将所述诉讼参与人名称和庭审地位作为诉讼参与人信息。
- 根据权利要求6所述的资产信息识别方法,其中,所述根据所述诉讼参与人信息对所述事实段落进行指代消解,包括:建立所述诉讼参与人名称与所述庭审地位之间的对应关系;基于所述对应关系对事实段落中的庭审地位代词进行替换,以完成事实段落的指代消解。
- 一种资产信息识别装置,其中,包括:文书解析模块,用于获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;诉讼信息模块,用于对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;指代消解模块,用于根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;信息提取模块,用于利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
- 根据权利要求8所述的资产信息识别装置,其中,所述信息提取模块,包括:初始匹配子模块,用于根据资产关键词在所述消解后的事实段落中进行匹配,以获得与所述资产关键词相匹配的初始资产关键句;句子过滤子模块,用于利用预先训练的文本分类模型对所述初始资产关键句进行过滤,以获得目标资产关键句;信息识别子模块,用于基于预先训练的实体识别模型对所述目标资产关键句进行资产信息的识别,以获得资产所有人、关联方和资产属性。
- 根据权利要求8所述的资产信息识别装置,其中,所述诉讼信息模块,包括;名称匹配子模块,用于根据名称关键词在所述诉讼参与人段落中进行匹配,以获得与所述名称关键词相匹配的目标句子;正则获取子模块,用于利用正则表达式从所述目标句子中获取诉讼参与人名称和庭审地位,将所述诉讼参与人名称和庭审地位作为诉讼参与人信息。
- 一种计算机设备,其中,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机可读指令;所述处理器,用于执行所述计算机可读指令并在执行所述计算机可读指令时实现一种资产信息识别方法:其中,所述资产信息识别方法包括:获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
- 根据权利要求11所述的计算机设备,其中,所述利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息,包括:根据资产关键词在所述消解后的事实段落中进行匹配,以获得与所述资产关键词相匹配的初始资产关键句;利用预先训练的文本分类模型对所述初始资产关键句进行过滤,以获得目标资产关键句;基于预先训练的实体识别模型对所述目标资产关键句进行资产信息的识别,以获得资产所有人、关联方和资产属性。
- 根据权利要求11所述的计算机设备,其中,所述对所述法律文书进行解析,以获得目标段落,包括:根据写作关键词对所述法律文书进行匹配以对所述法律文书进行分段,获得目标段落。
- 根据权利要求11所述的计算机设备,其中,所述对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息,包括:根据名称关键词在所述诉讼参与人段落中进行匹配,以获得与所述名称关键词相匹配的目标句子;利用正则表达式从所述目标句子中获取诉讼参与人名称和庭审地位,将所述诉讼参与人名称和庭审地位作为诉讼参与人信息。
- 根据权利要求14所述的计算机设备,其中,所述根据所述诉讼参与人信息对所述事实段落进行指代消解,包括:建立所述诉讼参与人名称与所述庭审地位之间的对应关系;基于所述对应关系对事实段落中的庭审地位代词进行替换,以完成事实段落的指代消解。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时使所述处理器实现资产信息识别方法,其中,所述资产信息识别方法包括以下步骤:获取法律文书并对所述法律文书进行解析,以获得目标段落,所述目标段落包括诉讼参与人段落和事实段落;对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息;根据所述诉讼参与人信息对所述事实段落进行指代消解,以获得消解后的事实段落;利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息。
- 根据权利要求16所述的计算机可读存储介质,其中,所述利用预先训练的文本分类模型和实体识别模型对所述消解后的事实段落进行信息提取,以获得资产信息,包括:根据资产关键词在所述消解后的事实段落中进行匹配,以获得与所述资产关键词相匹配的初始资产关键句;利用预先训练的文本分类模型对所述初始资产关键句进行过滤,以获得目标资产关键句;基于预先训练的实体识别模型对所述目标资产关键句进行资产信息的识别,以获得资产所有人、关联方和资产属性。
- 根据权利要求16所述的计算机可读存储介质,其中,所述对所述法律文书进行解析,以获得目标段落,包括:根据写作关键词对所述法律文书进行匹配以对所述法律文书进行分段,获得目标段落。
- 根据权利要求16所述的计算机可读存储介质,其中,所述对所述诉讼参与人段落进行信息提取,以获取诉讼参与人信息,包括:根据名称关键词在所述诉讼参与人段落中进行匹配,以获得与所述名称关键词相匹配的目标句子;利用正则表达式从所述目标句子中获取诉讼参与人名称和庭审地位,将所述诉讼参与人名称和庭审地位作为诉讼参与人信息。
- 根据权利要求19所述的计算机可读存储介质,其中,所述根据所述诉讼参与人信息对所述事实段落进行指代消解,包括:建立所述诉讼参与人名称与所述庭审地位之间的对应关系;基于所述对应关系对事实段落中的庭审地位代词进行替换,以完成事实段落的指代消解。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910882814.5A CN110781299B (zh) | 2019-09-18 | 2019-09-18 | 资产信息识别方法、装置、计算机设备及存储介质 |
CN201910882814.5 | 2019-09-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021051867A1 true WO2021051867A1 (zh) | 2021-03-25 |
Family
ID=69383550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/093110 WO2021051867A1 (zh) | 2019-09-18 | 2020-05-29 | 资产信息识别方法、装置、计算机设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110781299B (zh) |
WO (1) | WO2021051867A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115733903A (zh) * | 2022-09-23 | 2023-03-03 | 湖南华顺信安科技有限公司 | 一种基于自然处理特征工程的网络资产识别方法和系统 |
CN115906844A (zh) * | 2022-11-02 | 2023-04-04 | 中国兵器工业计算机应用技术研究所 | 一种基于规则模板的信息抽取方法和系统 |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11475209B2 (en) | 2017-10-17 | 2022-10-18 | Handycontract Llc | Device, system, and method for extracting named entities from sectioned documents |
US10726198B2 (en) | 2017-10-17 | 2020-07-28 | Handycontract, LLC | Method, device, and system, for identifying data elements in data structures |
CN110781299B (zh) * | 2019-09-18 | 2024-03-19 | 平安科技(深圳)有限公司 | 资产信息识别方法、装置、计算机设备及存储介质 |
CN111914542A (zh) * | 2020-05-21 | 2020-11-10 | 国家计算机网络与信息安全管理中心 | 疑似非法集资市场主体识别方法、装置、终端及存储介质 |
CN111798344B (zh) * | 2020-07-01 | 2023-09-22 | 北京金堤科技有限公司 | 主体名称确定方法和装置、电子设备和存储介质 |
CN111858942A (zh) * | 2020-07-29 | 2020-10-30 | 阳光保险集团股份有限公司 | 一种文本抽取方法、装置、存储介质和电子设备 |
CN112183076A (zh) * | 2020-08-28 | 2021-01-05 | 北京望石智慧科技有限公司 | 一种物质名称提取方法、装置及存储介质 |
CN112052305A (zh) * | 2020-09-02 | 2020-12-08 | 平安资产管理有限责任公司 | 信息提取方法、装置、计算机设备及可读存储介质 |
CN112163072B (zh) * | 2020-09-30 | 2024-05-24 | 北京金堤征信服务有限公司 | 基于多数据源的数据处理方法以及装置 |
CN112528028A (zh) * | 2020-12-28 | 2021-03-19 | 北京华彬立成科技有限公司 | 投融资信息挖掘方法、装置、电子设备和存储介质 |
CN112732897A (zh) * | 2020-12-28 | 2021-04-30 | 平安科技(深圳)有限公司 | 文档处理方法、装置、电子设备及存储介质 |
CN112580299A (zh) * | 2020-12-30 | 2021-03-30 | 讯飞智元信息科技有限公司 | 智能评标方法、评标设备及计算机存储介质 |
CN113158001B (zh) * | 2021-03-25 | 2024-05-14 | 深圳市联软科技股份有限公司 | 一种网络空间ip资产归属及相关性判别方法及系统 |
CN113515587B (zh) * | 2021-06-02 | 2024-06-21 | 中国神华国际工程有限公司 | 一种标的物信息提取方法、装置、计算机设备及存储介质 |
CN113902568A (zh) * | 2021-10-30 | 2022-01-07 | 平安科技(深圳)有限公司 | 绿色资产的占比的识别方法及相关产品 |
CN113902569A (zh) * | 2021-10-30 | 2022-01-07 | 平安科技(深圳)有限公司 | 数字资产中的绿色资产的占比的识别方法及相关产品 |
CN115238645A (zh) * | 2022-08-03 | 2022-10-25 | 中国电子科技集团公司信息科学研究院 | 资产数据识别方法、装置、电子设备和计算机存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009097558A9 (en) * | 2008-01-30 | 2010-01-28 | Thomson Reuters Global Resources | Financial event and relationship extraction |
CN109446328A (zh) * | 2018-11-02 | 2019-03-08 | 成都四方伟业软件股份有限公司 | 一种文本识别方法、装置及其存储介质 |
CN109582772A (zh) * | 2018-11-27 | 2019-04-05 | 平安科技(深圳)有限公司 | 合同信息提取方法、装置、计算机设备和存储介质 |
CN109815268A (zh) * | 2018-12-21 | 2019-05-28 | 上海诺悦智能科技有限公司 | 一种交易制裁名单匹配系统 |
CN110781299A (zh) * | 2019-09-18 | 2020-02-11 | 平安科技(深圳)有限公司 | 资产信息识别方法、装置、计算机设备及存储介质 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120173289A1 (en) * | 2010-09-16 | 2012-07-05 | Thomson Reuters (Sientific) Llc | System and method for detecting and identifying patterns in insurance claims |
US20160103823A1 (en) * | 2014-10-10 | 2016-04-14 | The Trustees Of Columbia University In The City Of New York | Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents |
CN108287818A (zh) * | 2018-01-03 | 2018-07-17 | 小草数语(北京)科技有限公司 | 裁判文书中金额的提取方法、装置和电子设备 |
CN109446511B (zh) * | 2018-09-10 | 2022-07-08 | 平安科技(深圳)有限公司 | 裁判文书处理方法、装置、计算机设备和存储介质 |
CN110134792B (zh) * | 2019-05-22 | 2022-03-08 | 北京金山数字娱乐科技有限公司 | 文本识别方法、装置、电子设备以及存储介质 |
-
2019
- 2019-09-18 CN CN201910882814.5A patent/CN110781299B/zh active Active
-
2020
- 2020-05-29 WO PCT/CN2020/093110 patent/WO2021051867A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009097558A9 (en) * | 2008-01-30 | 2010-01-28 | Thomson Reuters Global Resources | Financial event and relationship extraction |
CN109446328A (zh) * | 2018-11-02 | 2019-03-08 | 成都四方伟业软件股份有限公司 | 一种文本识别方法、装置及其存储介质 |
CN109582772A (zh) * | 2018-11-27 | 2019-04-05 | 平安科技(深圳)有限公司 | 合同信息提取方法、装置、计算机设备和存储介质 |
CN109815268A (zh) * | 2018-12-21 | 2019-05-28 | 上海诺悦智能科技有限公司 | 一种交易制裁名单匹配系统 |
CN110781299A (zh) * | 2019-09-18 | 2020-02-11 | 平安科技(深圳)有限公司 | 资产信息识别方法、装置、计算机设备及存储介质 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115733903A (zh) * | 2022-09-23 | 2023-03-03 | 湖南华顺信安科技有限公司 | 一种基于自然处理特征工程的网络资产识别方法和系统 |
CN115906844A (zh) * | 2022-11-02 | 2023-04-04 | 中国兵器工业计算机应用技术研究所 | 一种基于规则模板的信息抽取方法和系统 |
CN115906844B (zh) * | 2022-11-02 | 2023-08-29 | 中国兵器工业计算机应用技术研究所 | 一种基于规则模板的信息抽取方法和系统 |
Also Published As
Publication number | Publication date |
---|---|
CN110781299A (zh) | 2020-02-11 |
CN110781299B (zh) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021051867A1 (zh) | 资产信息识别方法、装置、计算机设备及存储介质 | |
CN110163478B (zh) | 一种合同条款的风险审查方法及装置 | |
WO2022174491A1 (zh) | 基于人工智能的病历质控方法、装置、计算机设备及存储介质 | |
WO2022142011A1 (zh) | 一种地址识别方法、装置、计算机设备及存储介质 | |
CN110276023B (zh) | Poi变迁事件发现方法、装置、计算设备和介质 | |
WO2021139191A1 (zh) | 数据标注的方法以及数据标注的装置 | |
WO2021134524A1 (zh) | 数据处理方法、装置、电子设备和存储介质 | |
CN112287069B (zh) | 基于语音语义的信息检索方法、装置及计算机设备 | |
CN110134780B (zh) | 文档摘要的生成方法、装置、设备、计算机可读存储介质 | |
WO2023040493A1 (zh) | 事件检测 | |
WO2022089227A1 (zh) | 地址参数处理方法及相关设备 | |
CN110083832B (zh) | 文章转载关系的识别方法、装置、设备及可读存储介质 | |
CN112163072B (zh) | 基于多数据源的数据处理方法以及装置 | |
CN112613315B (zh) | 一种文本知识自动抽取方法、装置、设备及存储介质 | |
TWI745777B (zh) | 資料歸檔方法、裝置、電腦裝置及存儲媒體 | |
CN113762303B (zh) | 图像分类方法、装置、电子设备及存储介质 | |
WO2021114634A1 (zh) | 文本标注方法、设备及存储介质 | |
CN112199954A (zh) | 基于语音语义的疾病实体匹配方法、装置及计算机设备 | |
CN116912847A (zh) | 一种医学文本识别方法、装置、计算机设备及存储介质 | |
WO2023092719A1 (zh) | 病历数据的信息抽取方法、终端设备及可读存储介质 | |
CN107729944A (zh) | 一种低俗图片的识别方法、装置、服务器及存储介质 | |
CN116089732B (zh) | 基于广告点击数据的用户偏好识别方法及系统 | |
CN115982363A (zh) | 基于提示学习的小样本关系分类方法、系统、介质及电子设备 | |
CN115358817A (zh) | 基于社交数据的智能产品推荐方法、装置、设备及介质 | |
CN115880702A (zh) | 数据处理方法、装置、设备、程序产品及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20865427 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20865427 Country of ref document: EP Kind code of ref document: A1 |