WO2020211393A1 - Written judgment information retrieval method and device, computer apparatus, and storage medium - Google Patents

Written judgment information retrieval method and device, computer apparatus, and storage medium Download PDF

Info

Publication number
WO2020211393A1
WO2020211393A1 PCT/CN2019/122888 CN2019122888W WO2020211393A1 WO 2020211393 A1 WO2020211393 A1 WO 2020211393A1 CN 2019122888 W CN2019122888 W CN 2019122888W WO 2020211393 A1 WO2020211393 A1 WO 2020211393A1
Authority
WO
WIPO (PCT)
Prior art keywords
judgment
word
semantic
factor
document
Prior art date
Application number
PCT/CN2019/122888
Other languages
French (fr)
Chinese (zh)
Inventor
杨凤鑫
徐国强
邱寒
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020211393A1 publication Critical patent/WO2020211393A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to a method, device, computer equipment and storage medium for searching judgment document information.
  • the accumulated judgment document is a massive amount of data. How to retrieve the current required information from this massive amount of data troubles users.
  • Conventional retrieval methods include information index retrieval and semantic information retrieval. Among them, information index retrieval is based on inverted indexing, keyword matching, etc., and the results obtained are inaccurate; while semantic information retrieval is more accurate, but the amount of data processing is more accurate. The retrieval speed is slow.
  • a method, device, computer equipment, and storage medium for searching judgment document information are provided.
  • a method for searching judgment document information including:
  • the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document;
  • the information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  • a judgment document information retrieval device including:
  • the word splitting module is used to obtain the information to be retrieved and perform semantic-based word splitting on the retrieved information
  • the factor extraction module is used to extract the focus words in the semantic split result, and perform factor index extraction on the semantic split result to obtain a factor vector.
  • the factor index is an index that affects the judgment result in a judgment document, and the factor index is an influence judgment document The index of the judgment result;
  • the encoding compression module is used to input the focus word and the factor vector as features into a preset semantic hash vector model, read the encoding of the encoding layer in the preset semantic hash vector model, and compress the encoding into a hash value ;
  • the search module is configured to search for similar judgment documents in the judgment document database according to the hash value to generate a set of target judgment documents to be selected, and the judgment document database stores the corresponding relationship between the hash value and the judgment document Data;
  • the similarity matching module is used to perform similarity matching between the information to be retrieved and the judgment documents in the set of target judgment documents to be selected to obtain the target judgment document.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document;
  • the information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document;
  • the information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  • Fig. 1 is a schematic flow chart of a method for searching judgment document information according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for retrieving judgment document information in another embodiment.
  • Fig. 3 is a block diagram of a judgment document information retrieval device according to one or more embodiments.
  • Figure 4 is a block diagram of a computer device according to one or more embodiments.
  • a method for searching judgment document information includes:
  • S100 Obtain information to be retrieved, and perform semantic-based word splitting on the information to be retrieved.
  • Semantic-based word splitting refers to splitting the information to be retrieved into independent words based on the meaning of the words.
  • the information to be retrieved can be a part of a certain judgment document, such as a certain paragraph, a sentence, and the judgment result, etc.; the information to be searched can also be a key part of the judgment document, such as the judgment result of the judgment document and the matters involved in the judgment document Names of both parties, etc.
  • the semantic-based word splitting results are: hold, iron rod, axe, Tool, Yumou, execution, beating, victim, Youmou, help, Yumoujia, when resisting, being hacked.
  • the factor index is an index that affects the judgment result in the judgment document.
  • Focus words are generally key words used to characterize the main content of the entire judgment document.
  • a focus word set can be constructed based on historical experience data, and the focus words can be obtained by matching the preset focus word set and the word split result, for example,
  • the focus words can be beating, slashing, serious, minor, knife, lethal, etc.
  • Factor indicators are used to influence (determine) the judgment results of the entire judgment document, such as whether it is profitable, infringement, whether intentional injury, etc.
  • the selection of factor indicators can also be obtained based on historical experience data analysis.
  • the text and conclusion part of the judgment document will be selected as the data analysis, factor indicators will be selected from them, and the factor indicators will be qualitatively judged to obtain the factor vector, such as whether it is profit-yes, whether it is lethal-no, etc.
  • the factor index system adopts a tree structure, which can be divided into multiple large-type factor indicators, and multiple small-type factor indicators are assigned under each large-type factor indicator.
  • S300 Input the focus word and factor vector as features into the preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value.
  • the preset semantic hash vector model is a pre-built model, which can be obtained by training the semantic hash model based on historical data. Specifically, the preset semantic hash vector model can be obtained by training the deep neural network model based on historical data.
  • a data compression process can be understood. For a large amount of data input into the preset semantic hash vector model, the code in the coding layer is read, and the large amount of input data is compressed into a hash value. For example, suppose that step S200 obtains 10,000 focus words and 50 factor vectors, and 10050 features are input into the preset semantic hash vector model. The hash value can be compressed into 16-dimensional or 32-dimensional data through coding and compression in step S300. The amount of data is extremely reduced, which is conducive to post-processing.
  • the judgment document database to be searched is a pre-built database, and a large number of judgment documents are stored in the database, and the hash value corresponding to the judgment document is also stored. Since the hash value is generated according to the input characteristics, and the input characteristics can accurately represent the entire information to be retrieved, a decision document similar to the judgment document to be searched can be found in the judgment document data to be searched based on the hash value. In addition, since the data has been compressed, there are many similar judgment documents that it can find, and there may be many similar judgment documents that can be searched, which can be aggregated into a set of target judgment documents to be selected. According to the information to be retrieved in step S300, similar judgment documents are searched in the massive data of the database to obtain a set of target judgment documents to be selected.
  • step S300 10050 vectors are compressed into a 16-dimensional hash value.
  • a search in the judgment document database can find 1000 similar judgment documents. What needs to be pointed out
  • the similar judgment document can be a complete judgment document or a part of the judgment document.
  • the information to be retrieved is "the first-instance civil judgment of Zhong Fengjian, Chen Dexiang, and Zhang Haiyuan motor vehicle traffic accident liability dispute".
  • step S400 it is divided into 1000 vectors and input into the preset semantic hash vector model to obtain
  • the 32-dimensional hash value is specifically [0 0 0 1 1...0 1 1 1].
  • the set of similar candidate target judgment documents found in the judgment document database to be searched includes: [0 0 0 1 1... 0 1 1] Jiang Xueqin and Taiping Property Insurance Co., Ltd. Yichang Center Branch, Shi Lei Motor Vehicle Traffic Accident Liability Disputes Civil Judgment of the first instance; [0 0 1 1... 0 1 0 1 ⁇ Zhang Han and Xiamen Jinyuan Financial Guarantee Co., Ltd. apply for retrial civil ruling on general loan contract disputes; the original data can be greatly compressed according to the hash value.
  • the relatively similar candidate information can be quickly found in the massive data.
  • S500 Perform similarity matching between the information to be retrieved and the judgment documents in the set of target judgment documents to be selected to obtain the target judgment document.
  • the above judgment document information retrieval method obtains the information to be retrieved, performs semantic-based word splitting on the retrieved information, extracts the focus words in the semantic split results, and performs factor index extraction on the semantic split results to obtain the factor vector, and the focus words
  • the sum factor vector is input as a feature to the preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, compress the code into a hash value, and search for similarity in the judgment document database according to the hash value
  • the judgment document generates a set of target judgment documents to be selected, and matches the similarity of the information to be retrieved with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  • the data in the searched information and judgment document database is compressed by the hash value method, the first stage positioning is performed according to the hash value, the set of candidate target judgment documents is found, and the similarity matching method is adopted in the second stage , Find the target judgment document in the target judgment document collection, because the hash value compression method is used to significantly reduce the amount of data processing, and the hash value compression and similarity matching method are used to ensure the efficiency and accuracy of retrieval.
  • performing factor index extraction on the semantic split result to obtain a factor vector includes: extracting the factor index associated in the semantic split result; according to the semantic split result, qualitatively judge the extracted factor index to obtain the factor vector.
  • Factor indicators are used to influence the final judgment result, such as whether it constitutes a crime, whether it bears joint liability, whether it is illegal embezzlement, whether it is for profit, etc.
  • the extraction of these indicators can be pre-set based on the analysis of the historical judgment text. Since the judgment document has its fixed format, the judgment result section will state the factual basis of the judgment result. Based on these conventional factual basis, it can be extracted Factor indicators, and then qualitatively judge these factor indicators to determine whether there is a situation corresponding to the factor indicators, and obtain the factor vector. It can be understood that the factor vector includes two parts: factor index and qualitative judgment result.
  • the factor index includes whether it constitutes a crime, whether it is joint and severally liable, whether it is illegal infringement, and whether it is profit-making. These factor indexes are qualitatively judged, and the factor vector is Does not constitute a scope, bears joint liability, illegal appropriation, and profit.
  • extracting the focus words in the semantic splitting result includes: obtaining a focus word set; and extracting the focus words in the semantic splitting result according to the combination of the focus words.
  • the focus word set can be constructed in advance. For example, based on historical data analysis, it is known which words belong to the focus word in the judgment document.
  • the focus word is generally a word that appears multiple times in the judgment document and can be determined based on word frequency. Such as beatings, guns, knives, slashes, etc.
  • the focus word set can be generated in the following manner: obtaining a sample of historical judgment documents; randomly selecting a single historical judgment document sample, extracting words with a word frequency greater than a preset word frequency threshold in the selected single historical judgment document sample, and obtaining a set of candidate words; Obtain the word frequency of each word in the candidate word set in other historical judgment document samples and record it as inverse word frequency; calculate the product of each word frequency in the candidate word set and the corresponding inverse word frequency, and select the word whose product is greater than the preset threshold. Generate a set of focused words.
  • the focus word set considers word frequency and inverse word frequency. Inverse word frequency considers that some words may have a higher word frequency in a single judgment document, but the word frequency in other judgment documents, such as certain modal particles, exclude the interference of these words , Accurately construct a set of focus words.
  • the method before step S200, the method further includes:
  • the company name can be identified by a named entity based on the database.
  • the database stores common company names and grammar-based regular modal particles.
  • the split words are searched and filtered in the database.
  • the The word is filtered out.
  • the information to be retrieved is as follows: "The court Zhu was in front of Meishang Furniture Factory in UNK Community, Jiangning District, Nanjing. He had a dispute with Yu Jia because of driving problems, and Zhu gathered others to the workshop of Meishang Furniture Factory.”
  • the difference between the words "Shangmei Furniture Factory” is the company name, and the words are filtered.
  • the separated words are cleaned to reduce unnecessary or worthless words for the next step, which significantly reduces the amount of data processing in the next step and improves the processing efficiency of the entire solution.
  • step S500 includes:
  • S540 Obtain the similarity between the information to be retrieved and each subset in the set of target judgment documents to be selected.
  • S560 Select the subset with the highest similarity as the target judgment document.
  • the similarity matching model is a pre-built model that can accurately identify the similarity between input data.
  • the similarity matching model method is adopted to quickly and accurately determine the target judgment document, which brings convenience to the user.
  • a judgment document information retrieval device the device includes:
  • the word splitting module 100 is used to obtain the information to be retrieved, and perform semantic-based word splitting on the information to be retrieved;
  • the factor extraction module 200 is used to extract the focus words in the semantic splitting result, and extract the factor index of the semantic splitting result to obtain a factor vector, the factor index is an index that affects the judgment result in the judgment document;
  • the encoding compression module 300 is used to input the focus words and factor vectors as features into the preset semantic hash vector model, read the encoding of the encoding layer in the preset semantic hash vector model, and compress the encoding into a hash value;
  • the searching module 400 is configured to search for similar judgment documents in the judgment document database according to the hash value to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document;
  • the similarity matching module 500 is used to perform similarity matching between the information to be retrieved and each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  • the word splitting module 100 obtains the information to be retrieved, performs semantic-based word splitting on the retrieved information, and the factor extraction module 200 extracts the focus words in the semantic split result, and performs factor indexing on the semantic split result Extract to obtain the factor vector
  • the encoding compression module 300 inputs the focus word and factor vector as features into the preset semantic hash vector model, reads the encoding of the encoding layer in the preset semantic hash vector model, and compresses the encoding into a hash value
  • the search module 400 searches for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected.
  • the similarity matching module 500 compares the information to be retrieved with each judgment document in the set of target judgment documents to be selected Perform similarity matching to obtain the target judgment document.
  • the data in the searched information and judgment document database is compressed by the hash value method, the first stage positioning is performed according to the hash value, the set of candidate target judgment documents is found, and the similarity matching method is adopted in the second stage , Find the target judgment document in the target judgment document collection, because the hash value compression method is used to significantly reduce the amount of data processing, and the hash value compression and similarity matching method are used to ensure the efficiency and accuracy of retrieval.
  • the factor extraction module 200 is also used in the factor index acquisition module for acquiring the factor index associated in the extracted semantic split result; according to the semantic split result, the extracted factor index is qualitatively judged to obtain the factor vector.
  • the factor extraction module is also used to obtain the focus word set; according to the focus word combination, the focus word in the semantic split result is extracted.
  • the factor extraction module is also used to obtain a sample of historical judgment documents; a single historical judgment document sample is randomly selected, and words with a word frequency greater than a preset word frequency threshold in the selected single historical judgment document sample are extracted to obtain a set of candidate words ; Get the word frequency of each word in the candidate word set in other historical judgment document samples and record it as inverse word frequency; calculate the product of each word frequency in the candidate word set and the corresponding inverse word frequency, and select the word whose product is greater than the preset threshold , Generate a set of focus words.
  • the above-mentioned judgment document information retrieval device further includes a cleaning module, which is used to clean the semantically separated words to remove modal particles and enterprise names.
  • the similarity matching module 500 is further configured to input the information to be retrieved and the set of target judgment documents to be selected into the preset similarity matching model; to obtain each subset of the information to be retrieved and the set of target judgment documents to be selected The degree of similarity; select the subset with the highest degree of similarity as the target judgment document.
  • each module in the above judgment document information retrieval device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 4.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the computer equipment database is used to store data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for searching judgment document information is realized.
  • FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors implement the methods provided in any of the embodiments of the present application. The steps of the judgment document information retrieval method.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors implement any one of the embodiments of the present application. Provide the steps of the judgment document information retrieval method.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A written judgment information retrieval method, comprising: obtaining information to undergo retrieval, and performing semantic-based word segmentation on the information; extracting focus terms from a semantic segmentation result, and extracting factor indexes from the semantic segmentation result so as to obtain factor vectors; inputting the focus terms and the factor vectors as features into a preset semantic hash vector model, reading codes in a coding layer of the preset semantic hash vector model, and compressing the codes into hash values; searching, according to the hash values, a written judgment database for a similar written judgment, and generating a set of target written judgments to be selected; and performing similarity matching with respect to the information and each written judgment in the set, so as to obtain a target written judgment.

Description

判决文书信息检索方法、装置、计算机设备和存储介质Judgment document information retrieval method, device, computer equipment and storage medium
相关申请的交叉引用Cross references to related applications
本申请要求于2019年04月16日提交中国专利局,申请号为201910303290X,申请名称为“判决文书信息检索方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 16, 2019. The application number is 201910303290X, and the application name is "Judgment Document Information Retrieval Method, Device, Computer Equipment and Storage Medium". The reference is incorporated in this application.
技术领域Technical field
本申请涉及一种判决文书信息检索方法、装置、计算机设备和存储介质。This application relates to a method, device, computer equipment and storage medium for searching judgment document information.
背景技术Background technique
随着科学技术的房展,目前大量的数据涌入到人们生活中,如何在海量的数据中检索到所需数据已经成为难题。With the housing exhibition of science and technology, a large amount of data is flooding into people's lives. How to retrieve the required data from the massive data has become a problem.
以判决文书为例,随着时间推移,日积月累的判决文书是一个海量的数据,如何在这个海量的数据检索到当前所需信息困扰着用户。常规的检索方式包括信息索引检索和语义信息检索两种,其中,信息索引检索基于倒排索引、关键词匹配等方式,得到的结果不准确;而语义信息检索较为准确,但是其数据处理量,检索速度较慢。Take the judgment document as an example. As time goes by, the accumulated judgment document is a massive amount of data. How to retrieve the current required information from this massive amount of data troubles users. Conventional retrieval methods include information index retrieval and semantic information retrieval. Among them, information index retrieval is based on inverted indexing, keyword matching, etc., and the results obtained are inaccurate; while semantic information retrieval is more accurate, but the amount of data processing is more accurate. The retrieval speed is slow.
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种判决文书信息检索方法、装置、计算机设备和存储介质。According to various embodiments disclosed in the present application, a method, device, computer equipment, and storage medium for searching judgment document information are provided.
一种判决文书信息检索方法,包括:A method for searching judgment document information, including:
获取待检索信息,对待检索信息进行基于语义的词语拆分;Obtain the information to be retrieved and perform semantic-based word splitting on the information to be retrieved;
提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标;Extracting the focus words in the semantic splitting result, and extracting the factor index of the semantic splitting result to obtain a factor vector, where the factor index is an index that affects the judgment result in the judgment document;
将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;Input the focus word and the factor vector as features into a preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value;
根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document; and
将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
一种判决文书信息检索装置,包括:A judgment document information retrieval device, including:
词语拆分模块,用于获取待检索信息,对待检索信息进行基于语义的词语拆分;The word splitting module is used to obtain the information to be retrieved and perform semantic-based word splitting on the retrieved information;
因子抽取模块,用于提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标 抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标,因子指标为影响判决文书中判决结果的指标;The factor extraction module is used to extract the focus words in the semantic split result, and perform factor index extraction on the semantic split result to obtain a factor vector. The factor index is an index that affects the judgment result in a judgment document, and the factor index is an influence judgment document The index of the judgment result;
编码压缩模块,用于将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;The encoding compression module is used to input the focus word and the factor vector as features into a preset semantic hash vector model, read the encoding of the encoding layer in the preset semantic hash vector model, and compress the encoding into a hash value ;
查找模块,用于根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及The search module is configured to search for similar judgment documents in the judgment document database according to the hash value to generate a set of target judgment documents to be selected, and the judgment document database stores the corresponding relationship between the hash value and the judgment document Data; and
相似度匹配模块,用于将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The similarity matching module is used to perform similarity matching between the information to be retrieved and the judgment documents in the set of target judgment documents to be selected to obtain the target judgment document.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device, including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
获取待检索信息,对待检索信息进行基于语义的词语拆分;Obtain the information to be retrieved and perform semantic-based word splitting on the information to be retrieved;
提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标;Extracting the focus words in the semantic splitting result, and extracting the factor index of the semantic splitting result to obtain a factor vector, where the factor index is an index that affects the judgment result in the judgment document;
将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;Input the focus word and the factor vector as features into a preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value;
根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document; and
将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取待检索信息,对待检索信息进行基于语义的词语拆分;Obtain the information to be retrieved, and perform semantic-based word splitting on the information to be retrieved;
提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标;Extracting the focus words in the semantic splitting result, and extracting the factor index of the semantic splitting result to obtain a factor vector, where the factor index is an index that affects the judgment result in the judgment document;
将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;Input the focus word and the factor vector as features into a preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value;
根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document; and
将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为根据一个或多个实施例中判决文书信息检索方法的流程示意图。Fig. 1 is a schematic flow chart of a method for searching judgment document information according to one or more embodiments.
图2为又一个实施例中判决文书信息检索方法的流程示意图。Fig. 2 is a schematic flowchart of a method for retrieving judgment document information in another embodiment.
图3为根据一个或多个实施例中判决文书信息检索装置的框图。Fig. 3 is a block diagram of a judgment document information retrieval device according to one or more embodiments.
图4为根据一个或多个实施例中计算机设备的框图。Figure 4 is a block diagram of a computer device according to one or more embodiments.
具体实施方式detailed description
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application clearer, the following further describes the present application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
如图1所示,一种判决文书信息检索方法,方法包括:As shown in Figure 1, a method for searching judgment document information includes:
S100:获取待检索信息,对待检索信息进行基于语义的词语拆分。S100: Obtain information to be retrieved, and perform semantic-based word splitting on the information to be retrieved.
基于语义的词语拆分是指基于词语的含义,将待检索信息拆分为独立的词语。待检索信息可以为某份判决文书中的一部分,例如某一段、某一句话以及判断结果等;待检索信息还可以为判决文件中关键部分内容,例如判决文书的判决结果、判决文书中涉事双方名称等。如待检索信息为“持铁棍、斧子等工具对于某甲实施殴打,被害人尤某在帮助于某甲抵挡时被砍”,进行基于语义的词语拆分结果为:持、铁棍、斧头、工具、于某甲、实施、殴打、被害人、尤某、帮助、于某甲、抵挡时、被砍。Semantic-based word splitting refers to splitting the information to be retrieved into independent words based on the meaning of the words. The information to be retrieved can be a part of a certain judgment document, such as a certain paragraph, a sentence, and the judgment result, etc.; the information to be searched can also be a key part of the judgment document, such as the judgment result of the judgment document and the matters involved in the judgment document Names of both parties, etc. For example, if the information to be retrieved is "the victim was beaten by a tool such as an iron rod, an axe, etc., and the victim was chopped while helping a certain person to resist", the semantic-based word splitting results are: hold, iron rod, axe, Tool, Yumou, execution, beating, victim, Youmou, help, Yumoujia, when resisting, being hacked.
S200:提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,因子指标为影响判决文书中判决结果的指标。S200: Extract the focus words in the semantic split result, and perform factor index extraction on the semantic split result to obtain a factor vector. The factor index is an index that affects the judgment result in the judgment document.
焦点词语一般是用于表征整个判决文书主要内容的关键性词语,针对这类词语可以基于历史经验数据构建焦点词语集合,根据预设焦点词语集合与得到词语拆分结果进行匹配得到焦点词语,例如焦点词语可以为殴打、砍伤、重伤、轻伤、刀、致死等。因子指标用于左右(决定)整个判决文书判决结果,例如是否牟利、是否侵犯、是否故意伤害等,因子指标的选定同样可以基于历史经验数据分析得到,一般来说,由于判决文书的格式采用同一格式和描述方式,会选择判决文书的正文和结论部分作为数据分析,从中挑选出因子指标,对因子指标进行定性判断,得到因子向量,例如是否牟利-是、是否致死-否等。进一步的,可以分析历史判决文书样本,构建因子指标体系,因子指标体系采用树状架构,可以划分为多个大类因子指标、在每个大类因子指标下划设多个小类因子指标。Focus words are generally key words used to characterize the main content of the entire judgment document. For such words, a focus word set can be constructed based on historical experience data, and the focus words can be obtained by matching the preset focus word set and the word split result, for example, The focus words can be beating, slashing, serious, minor, knife, lethal, etc. Factor indicators are used to influence (determine) the judgment results of the entire judgment document, such as whether it is profitable, infringement, whether intentional injury, etc. The selection of factor indicators can also be obtained based on historical experience data analysis. Generally speaking, due to the format of the judgment document In the same format and description method, the text and conclusion part of the judgment document will be selected as the data analysis, factor indicators will be selected from them, and the factor indicators will be qualitatively judged to obtain the factor vector, such as whether it is profit-yes, whether it is lethal-no, etc. Furthermore, it is possible to analyze historical judgment document samples and construct a factor index system. The factor index system adopts a tree structure, which can be divided into multiple large-type factor indicators, and multiple small-type factor indicators are assigned under each large-type factor indicator.
S300:将焦点词语和因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值。S300: Input the focus word and factor vector as features into the preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value.
预设语义哈希向量模型是预先构建的模型,其可以根据历史数据对语义哈希模型训练 得到,其具体可以使根据历史数据对深度神经网络模型进行训练得到预设语义哈希向量模型。本步骤S300可以理解一个数据压缩的过程,针对输入到预设语义哈希向量模型中大量数据,再读取编码层中的编码,输入的大量数据压缩为哈希值。例如假定步骤S200得到10000个焦点词语和50个因子向量,10050个特征输入至预设语义哈希向量模型中,通过步骤S300编码压缩为哈希值可以压缩为16维或32维数据,压缩后数据量极度减小,有利于后期处理。The preset semantic hash vector model is a pre-built model, which can be obtained by training the semantic hash model based on historical data. Specifically, the preset semantic hash vector model can be obtained by training the deep neural network model based on historical data. In this step S300, a data compression process can be understood. For a large amount of data input into the preset semantic hash vector model, the code in the coding layer is read, and the large amount of input data is compressed into a hash value. For example, suppose that step S200 obtains 10,000 focus words and 50 factor vectors, and 10050 features are input into the preset semantic hash vector model. The hash value can be compressed into 16-dimensional or 32-dimensional data through coding and compression in step S300. The amount of data is extremely reduced, which is conducive to post-processing.
S400:根据哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据。S400: According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document.
待搜索判决文书数据库是预先构建的数据库,在数据库内存储有大量的判决文书,另外还存储有判决文书对应的哈希值。由于哈希值是根据输入的特征生成的,而输入的特征又能准确表征整个待检索信息,因此基于哈希值可以在待搜索判决文书数据中查找到与待搜索判决文书相似的判决文书。另外,由于数据已被压缩,其能够查找到的相似判决文书较多,可以搜索到的相似判决文书可能较多,可以将其汇聚为待选目标判决文书集合。根据步骤S300得到待检索信息在数据库海量数据中查找相似判决文书,得到待选目标判决文书集合。继续以上述为例,在步骤S300中将10050个向量压缩为16维的哈希值,根据该16维哈希值,在判决文书数据库中进行查找可以查找到1000个相似判决文书,需要指出的是该相似判决文书可以是完整的判决文书,也可以是判决文书中的一部分。在某个实施例中,待检索信息为“钟凤建与陈德祥、张海源机动车交通事故责任纠纷一审民事判决书”步骤S400将其拆分为1000个向量输入至预设语义哈希向量模型中,得到32维的哈希值其具体为【0 0 0 1 1……0 1 1 1】,根据该32维的哈希值在待搜索判决文书数据库中查找到相似的待选目标判决文书集合包括:【0 0 0 1 1……0 1 1 1】姜雪琴与太平财产保险有限公司宜昌中心支公司、石雷机动车交通事故责任纠纷一审民事判决书;【0 0 0 1 1……0 1 0 1】章瀚与厦门金原融资担保有限公司一般借款合同纠纷申请再审民事裁定书;可以根据哈希值,可以极大程度压缩原始数据另,在海量数据中快速查找到比较相似的待选信息。The judgment document database to be searched is a pre-built database, and a large number of judgment documents are stored in the database, and the hash value corresponding to the judgment document is also stored. Since the hash value is generated according to the input characteristics, and the input characteristics can accurately represent the entire information to be retrieved, a decision document similar to the judgment document to be searched can be found in the judgment document data to be searched based on the hash value. In addition, since the data has been compressed, there are many similar judgment documents that it can find, and there may be many similar judgment documents that can be searched, which can be aggregated into a set of target judgment documents to be selected. According to the information to be retrieved in step S300, similar judgment documents are searched in the massive data of the database to obtain a set of target judgment documents to be selected. Continuing with the above example, in step S300, 10050 vectors are compressed into a 16-dimensional hash value. According to the 16-dimensional hash value, a search in the judgment document database can find 1000 similar judgment documents. What needs to be pointed out The similar judgment document can be a complete judgment document or a part of the judgment document. In an embodiment, the information to be retrieved is "the first-instance civil judgment of Zhong Fengjian, Chen Dexiang, and Zhang Haiyuan motor vehicle traffic accident liability dispute". In step S400, it is divided into 1000 vectors and input into the preset semantic hash vector model to obtain The 32-dimensional hash value is specifically [0 0 0 1 1...0 1 1 1]. According to the 32-dimensional hash value, the set of similar candidate target judgment documents found in the judgment document database to be searched includes: [0 0 0 1 1... 0 1 1] Jiang Xueqin and Taiping Property Insurance Co., Ltd. Yichang Center Branch, Shi Lei Motor Vehicle Traffic Accident Liability Disputes Civil Judgment of the first instance; [0 0 1 1... 0 1 0 1 】Zhang Han and Xiamen Jinyuan Financial Guarantee Co., Ltd. apply for retrial civil ruling on general loan contract disputes; the original data can be greatly compressed according to the hash value. In addition, the relatively similar candidate information can be quickly found in the massive data.
S500:将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。S500: Perform similarity matching between the information to be retrieved and the judgment documents in the set of target judgment documents to be selected to obtain the target judgment document.
将待检索信息与待选目标判决集合中各个子集进行相似度匹配,选取匹配的度最高或者匹配度大于预设阈值对应的文本作为目标判决文书。由于待选目标判决文书集合与原始数据库中数据已经大大减少,在满足检索的准确性同时,将待检索信息与待选目标判决文书集合进行相似度匹配可以极大减少数据处理量,高效且准确检索到目标判决文书。Perform similarity matching between the information to be retrieved and each subset in the target judgment set to be selected, and select the text with the highest matching degree or the matching degree greater than a preset threshold as the target judgment document. Since the set of target judgment documents to be selected and the data in the original database have been greatly reduced, while satisfying the accuracy of retrieval, matching the similarity of the information to be retrieved with the set of target judgment documents can greatly reduce the amount of data processing, which is efficient and accurate The target judgment document was retrieved.
上述判决文书信息检索方法,获取待检索信息,对待检索信息进行基于语义的词语拆分,提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,将焦点词语和因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值,根据哈希值,在判决文书数据库中查找相似判 决文书,生成待选目标判决文书集合,将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。整个过程中,采用哈希值的方式对待检索信息和判决文书数据库中数据进行压缩,根据哈希值进行第一阶段定位,查找到待选目标判决文书集合,在第二阶段采用相似度匹配方式,在目标判决文书集合中查找到目标判决文书,由于采用哈希值压缩方式显著减少数据处理量,并且采用哈希值压缩与相似度匹配方式确保检索的高效与准确。The above judgment document information retrieval method obtains the information to be retrieved, performs semantic-based word splitting on the retrieved information, extracts the focus words in the semantic split results, and performs factor index extraction on the semantic split results to obtain the factor vector, and the focus words The sum factor vector is input as a feature to the preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, compress the code into a hash value, and search for similarity in the judgment document database according to the hash value The judgment document generates a set of target judgment documents to be selected, and matches the similarity of the information to be retrieved with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document. In the whole process, the data in the searched information and judgment document database is compressed by the hash value method, the first stage positioning is performed according to the hash value, the set of candidate target judgment documents is found, and the similarity matching method is adopted in the second stage , Find the target judgment document in the target judgment document collection, because the hash value compression method is used to significantly reduce the amount of data processing, and the hash value compression and similarity matching method are used to ensure the efficiency and accuracy of retrieval.
在其中一个实施例中,对语义拆分结果进行因子指标抽取,得到因子向量包括:抽取语义拆分结果中关联的因子指标;根据语义拆分结果,对抽取的因子指标进行定性判断,得到因子向量。In one of the embodiments, performing factor index extraction on the semantic split result to obtain a factor vector includes: extracting the factor index associated in the semantic split result; according to the semantic split result, qualitatively judge the extracted factor index to obtain the factor vector.
因子指标用于影响最终判决结果,例如是否构成犯罪、是否承担连带责任、是否非法侵占、是否牟利等。这些指标的提取可以基于对历史判决书文本分析预先设定,由于判决文书有其固定的格式,在其判决结果部分会陈述本次判决结果的事实依据有哪些,基于这些常规的事实依据可以提取出因子指标,再对这些因子指标进行定性判断,判断是否存在该因子指标对应的情况,得到因子向量。可以理解,在因子向量中包括因子指标和定性判定结果两个部分,例如因子指标包括是否构成犯罪、是否承担连带责任、是否非法侵占、是否牟利,对这些因子指标进行定性判定,得到因子向量为未构成范围、承担连带责任、非法侵占、牟利。Factor indicators are used to influence the final judgment result, such as whether it constitutes a crime, whether it bears joint liability, whether it is illegal embezzlement, whether it is for profit, etc. The extraction of these indicators can be pre-set based on the analysis of the historical judgment text. Since the judgment document has its fixed format, the judgment result section will state the factual basis of the judgment result. Based on these conventional factual basis, it can be extracted Factor indicators, and then qualitatively judge these factor indicators to determine whether there is a situation corresponding to the factor indicators, and obtain the factor vector. It can be understood that the factor vector includes two parts: factor index and qualitative judgment result. For example, the factor index includes whether it constitutes a crime, whether it is joint and severally liable, whether it is illegal infringement, and whether it is profit-making. These factor indexes are qualitatively judged, and the factor vector is Does not constitute a scope, bears joint liability, illegal appropriation, and profit.
在其中一个实施例中,提取语义拆分结果中焦点词语包括:获取焦点词语集合;根据焦点词语结合,提取语义拆分结果中焦点词语。In one of the embodiments, extracting the focus words in the semantic splitting result includes: obtaining a focus word set; and extracting the focus words in the semantic splitting result according to the combination of the focus words.
焦点词语集合可以是预先构建的,例如基于历史数据分析得知在判决文书中哪些词语属于焦点词语,焦点词语一般是在判决文书中多次出现的词语,可以基于词频来确定。例如殴打、枪、刀、砍伤等。进一步的,焦点词语集合可以采用如下方式生成:获取历史判决文书样本;随机选择单个历史判决文书样本,提取选择的单个历史判决文书样本中词频大于预设词频阈值的词语,得到待选词语集合;获取待选词语集合中各个词语在其他历史判决文书样本中的词频,记录为逆词频;分别计算待选词语集合中各个词语词频与对应逆词频的乘积,选择乘积大于预设阈值对应的词语,生成焦点词语集合。The focus word set can be constructed in advance. For example, based on historical data analysis, it is known which words belong to the focus word in the judgment document. The focus word is generally a word that appears multiple times in the judgment document and can be determined based on word frequency. Such as beatings, guns, knives, slashes, etc. Further, the focus word set can be generated in the following manner: obtaining a sample of historical judgment documents; randomly selecting a single historical judgment document sample, extracting words with a word frequency greater than a preset word frequency threshold in the selected single historical judgment document sample, and obtaining a set of candidate words; Obtain the word frequency of each word in the candidate word set in other historical judgment document samples and record it as inverse word frequency; calculate the product of each word frequency in the candidate word set and the corresponding inverse word frequency, and select the word whose product is greater than the preset threshold. Generate a set of focused words.
在实际应用中,从历史判决文书样本中提取高频词语,获取任意单个判决文书中的高频词语的词频以及该词语在其他判决文书中的逆词频,计算该词语的词频*逆词频的乘积,选择乘积大于预设值的词语作为焦点词语集合中的子集。在上述“其他”可以是除当前选定的判决文书以外的所有判决文书,也可以是随机选择另外一个判决文书作为逆词频的统计样本。例如从历史判决文书中抽取判决文书样本一和判决文书样本二,统计在判决样本一种各个词语的词频,得到高频词语A、B、C计算词语A、B、C在判决文书样本二中的词频作为逆词频,计算词频与逆词频的乘积,选择乘积较大的词语作为焦点词语,重复上述操作,最终生成焦点词语集合。在上述实施中,焦点词语集合考虑词频和逆词频,逆词频考虑部分词语可能在单个判决文书词频较高,但是在其他判决文书中词频交底情况,例 如某些语气词,排除这部分词语的干扰,准确构建焦点词语集合。In practical applications, extract high-frequency words from samples of historical judgment documents, obtain the word frequency of high-frequency words in any single judgment document and the inverse word frequency of this word in other judgment documents, and calculate the product of the word frequency * inverse word frequency , Select words whose product is greater than the preset value as a subset of the focus word set. The “other” mentioned above can be all judgment documents except the currently selected judgment document, or it can be a random selection of another judgment document as a statistical sample of the frequency of inverse words. For example, extract the judgment document sample 1 and the judgment document sample 2 from the historical judgment document, count the word frequency of each word in the judgment sample, and obtain the high frequency words A, B, C and calculate the words A, B, C in the judgment document sample 2 The word frequency of is used as the inverse word frequency, the product of the word frequency and the inverse word frequency is calculated, the word with the larger product is selected as the focus word, and the above operation is repeated to finally generate the focus word set. In the above implementation, the focus word set considers word frequency and inverse word frequency. Inverse word frequency considers that some words may have a higher word frequency in a single judgment document, but the word frequency in other judgment documents, such as certain modal particles, exclude the interference of these words , Accurately construct a set of focus words.
如图2所示,在其中一个实施例中,步骤S200之前,还包括:As shown in FIG. 2, in one of the embodiments, before step S200, the method further includes:
S120:对语义拆分出的词语进行去除语气词与企业名称清洗。S120: Perform modal removal and company name cleaning on the semantically separated words.
企业名称可以通过基于数据库的命名实体来识别。在数据库中存储有比较常见的企业名称和基于语法的常规语气词,当进行数据清洗时,将拆分出的词语在数据库中进行查找过滤,当某个词语可以再数据库中查找到时,将该词语过滤掉。例如待检索信息如下“被告人朱某在南京市江宁区横溪街道UNK社区美尚家具厂门前,因驾车问题与于某甲发生争执,后朱某纠集他人至美尚家具厂车间内”基于实体识别出差分的词语中“尚美家具厂”为企业名称,对该词语进行过滤。在本实施例中,针对拆分出的词语进行清洗,减少不必要或无价值的词语进行下一步处理,显著减少下一步数据处理量,提高整个方案的处理效率。The company name can be identified by a named entity based on the database. The database stores common company names and grammar-based regular modal particles. When data cleaning is performed, the split words are searched and filtered in the database. When a word can be found in the database, the The word is filtered out. For example, the information to be retrieved is as follows: "The defendant Zhu was in front of Meishang Furniture Factory in UNK Community, Jiangning District, Nanjing. He had a dispute with Yu Jia because of driving problems, and Zhu gathered others to the workshop of Meishang Furniture Factory." Based on the entity recognition, the difference between the words "Shangmei Furniture Factory" is the company name, and the words are filtered. In this embodiment, the separated words are cleaned to reduce unnecessary or worthless words for the next step, which significantly reduces the amount of data processing in the next step and improves the processing efficiency of the entire solution.
如图2所示,在其中一个实施例中,步骤S500包括:As shown in FIG. 2, in one of the embodiments, step S500 includes:
S520:将待检索信息与待选目标判决文书集合输入至预设相似度匹配模型。S520: Input the information to be retrieved and the set of target judgment documents to be selected into the preset similarity matching model.
S540:获取待检索信息与待选目标判决文书集合中各个子集的相似度。S540: Obtain the similarity between the information to be retrieved and each subset in the set of target judgment documents to be selected.
S560:选择相似度最高的子集作为目标判决文书。S560: Select the subset with the highest similarity as the target judgment document.
相似度匹配模型是预先构建的模型,其可以准确识别输入数据之间的相似度。在本实施例中,采用相似度匹配模型方式,快速且准确确定目标判决文书,给用户带来便利。The similarity matching model is a pre-built model that can accurately identify the similarity between input data. In this embodiment, the similarity matching model method is adopted to quickly and accurately determine the target judgment document, which brings convenience to the user.
应该理解的是,虽然图1-2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行It should be understood that, although the various steps in the flowcharts of FIGS. 1-2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 1-2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The order of execution is not necessarily in sequence, but can be executed alternately or alternately with at least part of other steps or sub-steps or stages of other steps
如图3所示,一种判决文书信息检索装置,装置包括:As shown in Fig. 3, a judgment document information retrieval device, the device includes:
词语拆分模块100,用于获取待检索信息,对待检索信息进行基于语义的词语拆分;The word splitting module 100 is used to obtain the information to be retrieved, and perform semantic-based word splitting on the information to be retrieved;
因子抽取模块200,用于提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,因子指标为影响判决文书中判决结果的指标;The factor extraction module 200 is used to extract the focus words in the semantic splitting result, and extract the factor index of the semantic splitting result to obtain a factor vector, the factor index is an index that affects the judgment result in the judgment document;
编码压缩模块300,用于将焦点词语和因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;The encoding compression module 300 is used to input the focus words and factor vectors as features into the preset semantic hash vector model, read the encoding of the encoding layer in the preset semantic hash vector model, and compress the encoding into a hash value;
查找模块400,用于根据哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;The searching module 400 is configured to search for similar judgment documents in the judgment document database according to the hash value to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document;
相似度匹配模块500,用于将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The similarity matching module 500 is used to perform similarity matching between the information to be retrieved and each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
上述判决文书信息检索装置,词语拆分模块100获取待检索信息,对待检索信息进行基于语义的词语拆分,因子抽取模块200提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,编码压缩模块300将焦点词语和因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值,查找模块400哈希值根据哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,相似度匹配模块500将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。整个过程中,采用哈希值的方式对待检索信息和判决文书数据库中数据进行压缩,根据哈希值进行第一阶段定位,查找到待选目标判决文书集合,在第二阶段采用相似度匹配方式,在目标判决文书集合中查找到目标判决文书,由于采用哈希值压缩方式显著减少数据处理量,并且采用哈希值压缩与相似度匹配方式确保检索的高效与准确。In the aforementioned judgment document information retrieval device, the word splitting module 100 obtains the information to be retrieved, performs semantic-based word splitting on the retrieved information, and the factor extraction module 200 extracts the focus words in the semantic split result, and performs factor indexing on the semantic split result Extract to obtain the factor vector, the encoding compression module 300 inputs the focus word and factor vector as features into the preset semantic hash vector model, reads the encoding of the encoding layer in the preset semantic hash vector model, and compresses the encoding into a hash value According to the hash value, the search module 400 searches for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected. The similarity matching module 500 compares the information to be retrieved with each judgment document in the set of target judgment documents to be selected Perform similarity matching to obtain the target judgment document. In the whole process, the data in the searched information and judgment document database is compressed by the hash value method, the first stage positioning is performed according to the hash value, the set of candidate target judgment documents is found, and the similarity matching method is adopted in the second stage , Find the target judgment document in the target judgment document collection, because the hash value compression method is used to significantly reduce the amount of data processing, and the hash value compression and similarity matching method are used to ensure the efficiency and accuracy of retrieval.
在其中一个实施例中,因子抽取模块200还用于因子指标获取模块,用于获取抽取语义拆分结果中关联的因子指标;根据语义拆分结果,对抽取的因子指标进行定性判断,得到因子向量。In one of the embodiments, the factor extraction module 200 is also used in the factor index acquisition module for acquiring the factor index associated in the extracted semantic split result; according to the semantic split result, the extracted factor index is qualitatively judged to obtain the factor vector.
在其中一个实施例中,因子抽取模块还用于获取焦点词语集合;根据焦点词语结合,提取语义拆分结果中焦点词语。In one of the embodiments, the factor extraction module is also used to obtain the focus word set; according to the focus word combination, the focus word in the semantic split result is extracted.
在其中一个实施例中,因子抽取模块还用于获取历史判决文书样本;随机选择单个历史判决文书样本,提取选择的单个历史判决文书样本中词频大于预设词频阈值的词语,得到待选词语集合;获取待选词语集合中各个词语在其他历史判决文书样本中的词频,记录为逆词频;分别计算待选词语集合中各个词语词频与对应逆词频的乘积,选择乘积大于预设阈值对应的词语,生成焦点词语集合。In one of the embodiments, the factor extraction module is also used to obtain a sample of historical judgment documents; a single historical judgment document sample is randomly selected, and words with a word frequency greater than a preset word frequency threshold in the selected single historical judgment document sample are extracted to obtain a set of candidate words ; Get the word frequency of each word in the candidate word set in other historical judgment document samples and record it as inverse word frequency; calculate the product of each word frequency in the candidate word set and the corresponding inverse word frequency, and select the word whose product is greater than the preset threshold , Generate a set of focus words.
在其中一个实施例中,上述判决文书信息检索装置还包括清洗模块,用于对语义拆分出的词语进行去除语气词与企业名称清洗。In one of the embodiments, the above-mentioned judgment document information retrieval device further includes a cleaning module, which is used to clean the semantically separated words to remove modal particles and enterprise names.
在其中一个实施例中,相似度匹配模块500还用于将待检索信息与待选目标判决文书集合输入至预设相似度匹配模型;获取待检索信息与待选目标判决文书集合中各个子集的相似度;选择相似度最高的子集作为目标判决文书。In one of the embodiments, the similarity matching module 500 is further configured to input the information to be retrieved and the set of target judgment documents to be selected into the preset similarity matching model; to obtain each subset of the information to be retrieved and the set of target judgment documents to be selected The degree of similarity; select the subset with the highest degree of similarity as the target judgment document.
关于判决文书信息检索装置的具体限定可以参见上文中对于判决文书信息检索方法的限定,在此不再赘述。上述判决文书信息检索装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the judgment document information retrieval device, please refer to the above limitation on the judgment document information retrieval method, which will not be repeated here. Each module in the above judgment document information retrieval device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和 数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种判决文书信息检索方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 4. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The computer equipment database is used to store data. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for searching judgment document information is realized.
本领域技术人员可以理解,图4中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的判决文书信息检索方法的步骤。A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors implement the methods provided in any of the embodiments of the present application. The steps of the judgment document information retrieval method.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的判决文书信息检索方法的步骤。One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement any one of the embodiments of the present application. Provide the steps of the judgment document information retrieval method.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种判决文书信息检索方法,包括:A method for searching judgment document information, including:
    获取待检索信息,对待检索信息进行基于语义的词语拆分;Obtain the information to be retrieved and perform semantic-based word splitting on the information to be retrieved;
    提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标;Extracting the focus words in the semantic splitting result, and extracting the factor index of the semantic splitting result to obtain a factor vector, where the factor index is an index that affects the judgment result in the judgment document;
    将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;Input the focus word and the factor vector as features into a preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value;
    根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document; and
    将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  2. 根据权利要求1所述的方法,其特征在于,所述对语义拆分结果进行因子指标抽取,得到因子向量包括:The method according to claim 1, wherein said extracting a factor index from a semantic split result to obtain a factor vector comprises:
    抽取所述语义拆分结果中关联的因子指标;及Extracting related factor indexes in the semantic splitting result; and
    根据所述语义拆分结果,对抽取的所述因子指标进行定性判断,得到因子向量。According to the semantic splitting result, qualitative judgment is performed on the extracted factor index to obtain a factor vector.
  3. 根据权利要求2所述的方法,其特征在于,所述抽取所述语义拆分结果中关联的因子指标包括:The method according to claim 2, wherein said extracting the associated factor index in the semantic splitting result comprises:
    获取历史判决文书中正文部分数据和结论部分数据;Obtain the data of the main body part and the conclusion part of the historical judgment document;
    根据所述正文部分数据和所述结论部分数据挑选因子指标,构建因子指标集合;及Select factor indicators based on the main text part data and the conclusion part data to construct a set of factor indicators; and
    根据所述语义拆分结果,从所述因子指标集合中抽取关联的因子指标。According to the semantic split result, the related factor index is extracted from the factor index set.
  4. 根据权利要求1所述的方法,其特征在于,所述提取语义拆分结果中焦点词语包括:The method according to claim 1, wherein said extracting the focus words in the semantic splitting result comprises:
    获取焦点词语集合;及Get the focus word collection; and
    根据所述焦点词语结合,提取语义拆分结果中焦点词语。According to the focus word combination, the focus word in the semantic split result is extracted.
  5. 根据权利要求4所述的方法,其特征在于,所述获取焦点词语集合包括:The method according to claim 4, wherein said acquiring a focus word set comprises:
    获取历史判决文书样本;Obtain samples of historical judgment documents;
    随机选择单个历史判决文书样本,提取选择的单个历史判决文书样本中词频大于预设词频阈值的词语,得到待选词语集合;Randomly select a sample of a single historical judgment document, extract the words whose word frequency is greater than the preset word frequency threshold in the selected single historical judgment document sample, and obtain a set of candidate words;
    获取所述待选词语集合中各个词语在其他历史判决文书样本中的词频,记录为逆词频;及Obtain the word frequency of each word in the candidate word set in other historical judgment document samples, and record it as the inverse word frequency; and
    分别计算所述待选词语集合中各个词语词频与对应逆词频的乘积,选择所述乘积大于预设阈值对应的词语,生成焦点词语集合。The product of the word frequency of each word in the candidate word set and the corresponding inverse word frequency is calculated respectively, and words corresponding to the product of which the product is greater than a preset threshold are selected to generate a focused word set.
  6. 根据权利要求1所述的方法,其特征在于,所述提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量之前,还包括:The method according to claim 1, wherein the extracting focus words in the semantic splitting result, and performing factor index extraction on the semantic splitting result, before obtaining the factor vector, further comprises:
    对语义拆分出的词语进行去除语气词与企业名称清洗。Remove the modal particles and clean the company name of the semantically separated words.
  7. 根据权利要求6所述的方法,其特征在于,所述对语义拆分出的词语进行去除语气词与企业名称清洗包括:The method according to claim 6, wherein the removing modal particles and cleaning the company name on the semantically split words comprises:
    获取预设数据库,所述预设数据库中存储有企业名称与基于语法的语气词;及Obtain a preset database in which the company name and grammar-based modal particles are stored; and
    根据所述预设数据库对语义拆分出的词语进行查找过滤,去除语义拆分出的词语中语气词与企业名称。Search and filter the semantically separated words according to the preset database, and remove modal particles and company names in the semantically separated words.
  8. 根据权利要求1所述的方法,其特征在于,所述将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书包括:The method according to claim 1, wherein the matching the similarity of the information to be retrieved with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document comprises:
    将待检索信息与待选目标判决文书集合输入至预设相似度匹配模型;Input the information to be retrieved and the set of target judgment documents to be selected into the preset similarity matching model;
    获取待检索信息与所述待选目标判决文书集合中各个子集的相似度;及Acquiring the similarity between the information to be retrieved and each subset in the set of target judgment documents to be selected; and
    选择相似度最高的子集作为目标判决文书。The subset with the highest similarity is selected as the target judgment document.
  9. 根据权利要求1所述的方法,其特征在于,所述将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书包括:The method according to claim 1, wherein the matching the similarity of the information to be retrieved with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document comprises:
    将待检索信息与待选目标判决集合中各判决文书进行相似度匹配;及Match the similarity of the information to be retrieved with each judgment document in the target judgment set to be selected; and
    选取匹配度最高或匹配度大于预设阈值对应的判决文书作为目标判决文书。The judgment document corresponding to the highest matching degree or the matching degree greater than the preset threshold is selected as the target judgment document.
  10. 一种判决文书信息检索装置,包括:A judgment document information retrieval device, including:
    词语拆分模块,用于获取待检索信息,对待检索信息进行基于语义的词语拆分;The word splitting module is used to obtain the information to be retrieved and perform semantic-based word splitting on the retrieved information;
    因子抽取模块,用于提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标,因子指标为影响判决文书中判决结果的指标;The factor extraction module is used to extract the focus words in the semantic split result, and perform factor index extraction on the semantic split result to obtain a factor vector. The factor index is an index that affects the judgment result in a judgment document, and the factor index is an influence judgment document The index of the judgment result;
    编码压缩模块,用于将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;The encoding compression module is used to input the focus word and the factor vector as features into a preset semantic hash vector model, read the encoding of the encoding layer in the preset semantic hash vector model, and compress the encoding into a hash value ;
    查找模块,用于根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及The search module is configured to search for similar judgment documents in the judgment document database according to the hash value to generate a set of target judgment documents to be selected, and the judgment document database stores the corresponding relationship between the hash value and the judgment document Data; and
    相似度匹配模块,用于将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The similarity matching module is used to perform similarity matching between the information to be retrieved and the judgment documents in the set of target judgment documents to be selected to obtain the target judgment document.
  11. 根据权利要求10所述的装置,其特征在于,所述因子抽取模块还用于抽取所述语义拆分结果中关联的因子指标;及根据所述语义拆分结果,对抽取的所述因子指标进行定性判断,得到因子向量。10. The device according to claim 10, wherein the factor extraction module is further configured to extract the factor index associated in the semantic splitting result; and according to the semantic splitting result, the extracted factor index Make a qualitative judgment and get the factor vector.
  12. 根据权利要求10所述的装置,其特征在于,因子抽取模块还用于获取历史判决文书样本;随机选择单个历史判决文书样本,提取选择的单个历史判决文书样本中词频大于预设词频阈值的词语,得到待选词语集合;获取所述待选词语集合中各个词语在其他历史判决文书样本中的词频,记录为逆词频;分别计算所述待选词语集合中各个词语词频与对应逆词频的乘积,选择所述乘积大于预设阈值对应的词语,生成焦点词语集合;及根据 所述焦点词语结合,提取语义拆分结果中焦点词语。The device according to claim 10, wherein the factor extraction module is further used to obtain samples of historical judgment documents; randomly select a single historical judgment document sample, and extract words whose word frequency is greater than a preset word frequency threshold in the selected single historical judgment document sample , Obtain the candidate word set; obtain the word frequency of each word in the candidate word set in other historical judgment document samples, and record it as inverse word frequency; respectively calculate the product of the word frequency of each word in the candidate word set and the corresponding inverse word frequency , Selecting the words corresponding to the product greater than the preset threshold to generate a focus word set; and extracting the focus words in the semantic split result according to the combination of the focus words.
  13. 根据权利要求10所述的装置,其特征在于,所述装置还包括清洗模块,用于对语义拆分出的词语进行去除语气词与企业名称清洗。The device according to claim 10, wherein the device further comprises a cleaning module, which is used to clean the semantically separated words to remove modal particles and company names.
  14. 根据权利要求10所述的装置,其特征在于,相似度匹配模块还用于将待检索信息与待选目标判决文书集合输入至预设相似度匹配模型;获取待检索信息与所述待选目标判决文书集合中各个子集的相似度;及选择相似度最高的子集作为目标判决文书。The device according to claim 10, wherein the similarity matching module is further configured to input the information to be retrieved and the target judgment document set into a preset similarity matching model; to obtain the information to be retrieved and the target to be selected The similarity of each subset in the set of judgment documents; and select the subset with the highest similarity as the target judgment document.
  15. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:
    获取待检索信息,对待检索信息进行基于语义的词语拆分;Obtain the information to be retrieved, and perform semantic-based word splitting on the information to be retrieved;
    提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标;Extracting the focus words in the semantic splitting result, and extracting the factor index of the semantic splitting result to obtain a factor vector, where the factor index is an index that affects the judgment result in the judgment document;
    将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;Input the focus word and the factor vector as features into a preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value;
    根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document; and
    将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  16. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device according to claim 15, wherein the processor further executes the following steps when executing the computer-readable instruction:
    抽取所述语义拆分结果中关联的因子指标;及Extracting related factor indexes in the semantic splitting result; and
    根据所述语义拆分结果,对抽取的所述因子指标进行定性判断,得到因子向量。According to the semantic splitting result, qualitative judgment is performed on the extracted factor index to obtain a factor vector.
  17. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device according to claim 15, wherein the processor further executes the following steps when executing the computer-readable instruction:
    获取历史判决文书样本;Obtain samples of historical judgment documents;
    随机选择单个历史判决文书样本,提取选择的单个历史判决文书样本中词频大于预设词频阈值的词语,得到待选词语集合;Randomly select a sample of a single historical judgment document, extract the words whose word frequency is greater than the preset word frequency threshold in the selected single historical judgment document sample, and obtain a set of candidate words;
    获取所述待选词语集合中各个词语在其他历史判决文书样本中的词频,记录为逆词频;Obtain the word frequency of each word in the candidate word set in other historical judgment document samples, and record it as the inverse word frequency;
    分别计算所述待选词语集合中各个词语词频与对应逆词频的乘积,选择所述乘积大于预设阈值对应的词语,生成焦点词语集合;及Calculate the product of each word frequency and the corresponding inverse word frequency in the candidate word set respectively, and select the words corresponding to the product greater than a preset threshold to generate a focused word set; and
    根据所述焦点词语结合,提取语义拆分结果中焦点词语。According to the focus word combination, the focus word in the semantic split result is extracted.
  18. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机 可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取待检索信息,对待检索信息进行基于语义的词语拆分;Obtain the information to be retrieved, and perform semantic-based word splitting on the information to be retrieved;
    提取语义拆分结果中焦点词语,并对语义拆分结果进行因子指标抽取,得到因子向量,所述因子指标为影响判决文书中判决结果的指标;Extracting the focus words in the semantic splitting result, and extracting the factor index of the semantic splitting result to obtain a factor vector, where the factor index is an index that affects the judgment result in the judgment document;
    将所述焦点词语和所述因子向量作为特征输入至预设语义哈希向量模型,读取预设语义哈希向量模型中编码层的编码,将编码压缩为哈希值;Input the focus word and the factor vector as features into a preset semantic hash vector model, read the code of the coding layer in the preset semantic hash vector model, and compress the code into a hash value;
    根据所述哈希值,在判决文书数据库中查找相似判决文书,生成待选目标判决文书集合,所述判决文书数据库中存储有用于表征哈希值与判决文书之间对应关系的数据;及According to the hash value, search for similar judgment documents in the judgment document database to generate a set of target judgment documents to be selected, and the judgment document database stores data used to characterize the correspondence between the hash value and the judgment document; and
    将待检索信息与待选目标判决文书集合中各判决文书进行相似度匹配,得到目标判决文书。The information to be retrieved is matched with each judgment document in the set of target judgment documents to be selected to obtain the target judgment document.
  19. 根据权利要求18所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:18. The storage medium of claim 18, wherein the following steps are further performed when the computer-readable instructions are executed by the processor:
    抽取所述语义拆分结果中关联的因子指标;及Extracting related factor indexes in the semantic splitting result; and
    根据所述语义拆分结果,对抽取的所述因子指标进行定性判断,得到因子向量。According to the semantic splitting result, qualitative judgment is performed on the extracted factor index to obtain a factor vector.
  20. 根据权利要求18所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:18. The storage medium of claim 18, wherein the following steps are further performed when the computer-readable instructions are executed by the processor:
    获取历史判决文书样本;Obtain samples of historical judgment documents;
    随机选择单个历史判决文书样本,提取选择的单个历史判决文书样本中词频大于预设词频阈值的词语,得到待选词语集合;Randomly select a sample of a single historical judgment document, extract the words whose word frequency is greater than the preset word frequency threshold in the selected single historical judgment document sample, and obtain a set of candidate words;
    获取所述待选词语集合中各个词语在其他历史判决文书样本中的词频,记录为逆词频;Obtain the word frequency of each word in the candidate word set in other historical judgment document samples, and record it as the inverse word frequency;
    分别计算所述待选词语集合中各个词语词频与对应逆词频的乘积,选择所述乘积大于预设阈值对应的词语,生成焦点词语集合;及Calculate the product of each word frequency and the corresponding inverse word frequency in the candidate word set respectively, and select the words corresponding to the product greater than a preset threshold to generate a focused word set; and
    根据所述焦点词语结合,提取语义拆分结果中焦点词语。According to the focus word combination, the focus word in the semantic split result is extracted.
PCT/CN2019/122888 2019-04-16 2019-12-04 Written judgment information retrieval method and device, computer apparatus, and storage medium WO2020211393A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910303290.XA CN110134761A (en) 2019-04-16 2019-04-16 Adjudicate document information retrieval method, device, computer equipment and storage medium
CN201910303290.X 2019-04-16

Publications (1)

Publication Number Publication Date
WO2020211393A1 true WO2020211393A1 (en) 2020-10-22

Family

ID=67570221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122888 WO2020211393A1 (en) 2019-04-16 2019-12-04 Written judgment information retrieval method and device, computer apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN110134761A (en)
WO (1) WO2020211393A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium
CN111539022B (en) * 2020-04-27 2022-04-22 支付宝(杭州)信息技术有限公司 Feature matching method, target object identification method and related hardware
CN111581332A (en) * 2020-04-29 2020-08-25 山东大学 Similar judicial case matching method and system based on triple deep hash learning
CN111709252B (en) 2020-06-17 2023-03-28 北京百度网讯科技有限公司 Model improvement method and device based on pre-trained semantic model
CN113838457A (en) * 2020-06-24 2021-12-24 中兴通讯股份有限公司 Voice interaction method, electronic equipment and storage medium
CN111737420A (en) * 2020-08-07 2020-10-02 四川大学 Class case retrieval method, system, device and medium based on dispute focus
CN115134660A (en) * 2022-06-27 2022-09-30 中国平安人寿保险股份有限公司 Video editing method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609419A (en) * 2011-01-21 2012-07-25 北京世纪读秀技术有限公司 Similar data de-duplication method
CN103714118A (en) * 2013-11-22 2014-04-09 浙江大学 Book cross-reading method
CN105574063A (en) * 2015-08-24 2016-05-11 西安电子科技大学 Image retrieval method based on visual saliency
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123618B (en) * 2011-11-21 2016-09-14 北京新媒传信科技有限公司 Text similarity acquisition methods and device
CN103778163A (en) * 2012-10-26 2014-05-07 广州市邦富软件有限公司 Rapid webpage de-weight algorithm based on fingerprints
CN104239373B (en) * 2013-06-24 2019-02-01 腾讯科技(深圳)有限公司 Add tagged method and device for document
CN103425639A (en) * 2013-09-06 2013-12-04 广州一呼百应网络技术有限公司 Similar information identifying method based on information fingerprints
CN104199972B (en) * 2013-09-22 2018-08-03 中科嘉速(北京)信息技术有限公司 A kind of name entity relation extraction and construction method based on deep learning
JP2016042263A (en) * 2014-08-15 2016-03-31 富士通株式会社 Document management apparatus, document management program, and document management method
CN105786799A (en) * 2016-03-21 2016-07-20 成都寻道科技有限公司 Web article originality judgment method
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base
CN106933787A (en) * 2017-03-20 2017-07-07 上海智臻智能网络科技股份有限公司 Adjudicate the computational methods of document similarity, search device and computer equipment
CN107784110B (en) * 2017-11-03 2020-07-03 北京锐安科技有限公司 Index establishing method and device
CN108255957A (en) * 2017-12-21 2018-07-06 杭州传送门网络科技有限公司 One kind recommends matching process based on Venture Capital field precision dataization
CN108573045B (en) * 2018-04-18 2021-12-24 同方知网数字出版技术股份有限公司 Comparison matrix similarity retrieval method based on multi-order fingerprints

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609419A (en) * 2011-01-21 2012-07-25 北京世纪读秀技术有限公司 Similar data de-duplication method
CN103714118A (en) * 2013-11-22 2014-04-09 浙江大学 Book cross-reading method
CN105574063A (en) * 2015-08-24 2016-05-11 西安电子科技大学 Image retrieval method based on visual saliency
CN110134761A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Adjudicate document information retrieval method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110134761A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
WO2020211393A1 (en) Written judgment information retrieval method and device, computer apparatus, and storage medium
CN110765275B (en) Search method, search device, computer equipment and storage medium
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN108829858A (en) Data query method, apparatus and computer readable storage medium
CN107463548B (en) Phrase mining method and device
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
JP2020135853A (en) Method, apparatus, electronic device, computer readable medium, and computer program for determining descriptive information
CN109408578B (en) Monitoring data fusion method for heterogeneous environment
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN114911917B (en) Asset meta-information searching method and device, computer equipment and readable storage medium
KR102334236B1 (en) Method and application of meaningful keyword extraction from speech-converted text data
CN111291177A (en) Information processing method and device and computer storage medium
CN112685475A (en) Report query method and device, computer equipment and storage medium
CN110543603A (en) Collaborative filtering recommendation method, device, equipment and medium based on user behaviors
KR102345401B1 (en) methods and apparatuses for content retrieval, devices and storage media
US11449676B2 (en) Systems and methods for automated document graphing
CN112597292B (en) Question reply recommendation method, device, computer equipment and storage medium
CN111984625B (en) Database load characteristic processing method and device, medium and electronic equipment
KR20150122855A (en) Distributed processing system and method for real time question and answer
CN113761161A (en) Text keyword extraction method and device, computer equipment and storage medium
CN110399464B (en) Similar news judgment method and system and electronic equipment
CN110888977B (en) Text classification method, apparatus, computer device and storage medium
CN112416754B (en) Model evaluation method, terminal, system and storage medium
CN112115362B (en) Programming information recommendation method and device based on similar code recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925547

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/02/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19925547

Country of ref document: EP

Kind code of ref document: A1