WO2021042554A1 - Method and apparatus for archiving legal text, readable storage medium, and terminal device - Google Patents
Method and apparatus for archiving legal text, readable storage medium, and terminal device Download PDFInfo
- Publication number
- WO2021042554A1 WO2021042554A1 PCT/CN2019/118148 CN2019118148W WO2021042554A1 WO 2021042554 A1 WO2021042554 A1 WO 2021042554A1 CN 2019118148 W CN2019118148 W CN 2019118148W WO 2021042554 A1 WO2021042554 A1 WO 2021042554A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- server
- subset
- legal text
- legal
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000005192 partition Methods 0.000 claims abstract description 56
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000000875 corresponding effect Effects 0.000 description 22
- 230000008569 process Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/113—Details of archiving
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
Definitions
- This application belongs to the field of computer technology, and in particular relates to a method and device for filing legal texts, a computer non-volatile readable storage medium, and terminal equipment.
- the embodiments of the present application provide a legal text filing method, device, computer non-volatile readable storage medium, and terminal equipment to solve the problem that the existing legal text filing method consumes a lot of labor costs and is extremely inefficient. The problem.
- the first aspect of the embodiments of the present application provides a method for filing legal documents, which may include:
- the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
- Target server is a server for archiving the legal text
- a subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text
- the frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;
- the second aspect of the embodiments of the present application provides a legal text filing device, which may include modules for implementing the steps of the above-mentioned legal text filing method.
- a third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When implementing the steps of the above-mentioned legal text filing method.
- the fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer
- the steps of the above-mentioned legal text filing method are realized when the instructions are readable.
- the legal text is archived into the disk partition of each server according to its actual core content.
- the user needs to query related materials, he only needs to search in the disk partition of the corresponding server.
- the labor cost has greatly improved work efficiency.
- FIG. 1 is a flowchart of an embodiment of a method for filing legal text in an embodiment of the application
- Figure 2 is a schematic flow chart of selecting a core word subset from a word set
- Figure 3 is a schematic flow chart of determining a target server according to a core word subset
- FIG. 4 is a schematic flowchart of the setting process of the first word list
- Figure 5 is a schematic flow chart of determining the category of the legal text in the target server according to the auxiliary word subset
- Fig. 6 is a structural diagram of an embodiment of a legal document filing device in an embodiment of the application.
- FIG. 7 is a schematic block diagram of a terminal device in an embodiment of the application.
- an embodiment of a method for filing legal texts in an embodiment of the present application may include:
- Step S101 Receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address.
- the legal texts include, but are not limited to, texts in legal provisions, legal essays, legal reports, legal analysis articles, indictments, rulings, and other legal-related materials.
- the legal text storage instruction carries the address where the legal text is currently located, that is, The target address.
- the target address may be a certain storage address in the terminal device, or a certain storage address in the network or a designated database.
- the terminal device is the implementation subject of this embodiment. After receiving the legal text storage instruction, the terminal device can extract the target address from it, and obtain the target address from the local, network, or designated address according to the target address.
- the legal text is obtained from the database.
- Step S102 Perform word segmentation processing on the legal text to obtain a set of words constituting the legal text.
- the terminal device will first perform word segmentation processing on it to obtain a set of words that constitute the legal text.
- Word segmentation refers to dividing the legal text into individual words.
- the general dictionary and the legal dictionary can be combined to segment the legal text, that is, the legal dictionary is used to split the legal text.
- the legal text is segmented in the first round, and then the general dictionary is used to segment the remaining legal texts after the first round of segmentation.
- the legal-specific terms are firstly segmented, and then the general terms are segmented.
- single words are separated.
- Step S103 Select a core word subset from the word set.
- the core word subset includes each word whose term density is greater than the preset first threshold and the uniformity is greater than the preset second threshold.
- step S103 may specifically include the following steps:
- Step S1031 respectively calculate the entry density of each word in the word set.
- the entry density of each word in the word set can be calculated according to the following formula:
- WdNum w is the wth word in the word set in the legal text
- LineNum is the total number of lines of the legal text
- WdDensity w is the entry density of the w-th word in the word set.
- Step S1032 Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph.
- each KN line in the legal text can be regarded as a text paragraph, that is, the first line to the KN line in the legal text As the first text paragraph, take line KN+1 to line 2 ⁇ KN in the legal text as the second text paragraph, and change line 2 ⁇ KN+1 to line 3 ⁇ in the legal text Line KN is used as the third text paragraph, and so on.
- Ceil is a round-up function.
- the value of KN can be set according to specific conditions, for example, it can be set to 3, 5, 10 or other values and so on.
- Step S1033 Calculate the uniformity of each word in the word set respectively.
- the uniformity of each word in the word set can be calculated according to the following formula:
- f is the serial number of each text paragraph of the legal text, 1 ⁇ f ⁇ FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And WdEqu w is the uniformity of the wth word in the word set.
- Step S1034 Select each word with a word density greater than the first threshold and a uniformity greater than the second threshold from the word set to form the core word subset.
- the specific values of the first threshold and the second threshold may be set according to actual conditions.
- the following entry density sequence can be constructed first according to the order of value from largest to smallest:
- DensitySet ⁇ WdDensity 1 , WdDensity 2 , ..., WdDensity w , ..., WdDensity WN ⁇
- DensitySet is the term density sequence.
- MaxDensitySet ⁇ MaxWdDensity 1 , MaxWdDensity 2 , ..., MaxWdDensity nmax , ..., MaxWdDensity MaxNum ⁇
- MaxDensitySet is the maximum entry density sequence
- MaxNum is the number of values in the maximum entry density sequence
- MaxNum WN ⁇ 1
- ⁇ 1 is the first selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values
- nmax is the value sequence number in the maximum entry density sequence
- MaxWdDensity nmax is the nmax of the maximum entry density sequence value.
- MinDensitySet ⁇ MinWdDensity 1 , MinWdDensity 2 , ..., MinWdDensity nmin , ..., MinWdDensity MinNum ⁇
- MinDensitySet is the minimum entry density sequence
- MinNum is the number of values in the minimum entry density sequence
- MaxNum WN ⁇ 2
- ⁇ 2 is the second selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values
- nmin is the value sequence number in the minimum entry density sequence
- MinWdDensity nmin is the nminth value of the minimum entry density sequence value.
- MidDensitySet ⁇ MidWdDensity 1 , MidWdDensity 2 , ..., MidWdDensity nmid , ..., MidWdDensity MidNum ⁇
- MidDensitySet is the median term density sequence
- MidDensitySet DensitySet-MaxDensitySet-MinDensitySet
- MidNum is the number of values in the median term density sequence
- MidNum WN ⁇ (1- ⁇ 1- ⁇ 2 )
- nmid is the value sequence number in the median entry density sequence, 1 ⁇ nmid ⁇ MidNum
- MidWdDensity nmid is the nmid value in the median entry density sequence.
- the setting process of the second threshold is similar to the setting process of the first threshold. It is only necessary to replace the density of entries appearing therein with uniformity. For details, please refer to the above content, which will not be repeated here.
- Step S104 Select a target server from a preset server group according to the core word subset.
- the target server is a server used to archive the legal text.
- the server group may include three servers, which are respectively used to archive legal texts in the three legal fields of civil, criminal, and administrative.
- step S104 may specifically include:
- Step S1041 respectively query the first feature vector of each word in the core word subset in the preset first word list.
- the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to the feature value of a server.
- T is an integer greater than 1.
- T 3.
- Step S10411 Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database.
- the general legal text database includes legal text databases corresponding to various legal fields.
- the total legal text database contains as many legal texts as possible in a certain statistical time period.
- the statistical time period can be set according to the actual situation, for example, it can be set to a time period within a week, a month, a quarter, or a year from the current moment.
- All legal texts in the general legal text database will be divided into several legal text databases according to the legal field to which they belong, and each legal text database corresponds to a legal field.
- the general legal text database can be divided into civil Legal text library, criminal law text library, administrative law text library, etc.
- each legal text library also corresponds to a server that archives the legal field.
- step S101 The process of word segmentation is similar to the process in step S101.
- steps S101 please refer to the description in step S101, which will not be repeated here.
- Step S10412 Count the number of occurrences of each word constituting the legal text database in each legal text database.
- the number of occurrences of each word constituting the legal text database in each legal text database can be recorded as a sequence format as shown below:
- WNSeq sw (WordNum sw,1 ,WordNum sw,2 , whil,WordNum sw,t , whil,WordNum sw,T )
- t is the serial number of each server in the server group (that is, the serial number of the legal text database)
- sw is the serial number of each word composing the legal text database
- 1 ⁇ sw ⁇ SWN SWN is the total number of words that make up the legal text database
- WordNum sw,t is the number of times that the swth word that makes up the legal text database appears in the legal text database corresponding to the t-th server
- WNSeq sw It is the sequence of the number of occurrences of the swth word in each legal text database.
- Step S10413 Calculate the feature values corresponding to each server and each word that composes the legal text database.
- the feature values corresponding to each server and each word constituting the legal text database can be calculated separately according to the following formula:
- ln is a natural logarithmic function
- EigVal sw,t is the feature value corresponding to the t-th server and the sw-th word that constitutes the total legal text database.
- Step S10414 Construct a first feature vector of each word composing the legal text database.
- the first feature vector of each word composing the legal text database can be constructed according to the following formula:
- EigVec sw (EigVal sw,1 ,EigVal sw,2 , whil,EigVal sw,t , whil,EigVal sw,T )
- EigVec sw is the first feature vector of the sw-th word constituting the general database of legal texts.
- Step S10415 Construct a first feature vector of each word constituting the legal text database as the first word list.
- Step S1042 according to the first feature vector of each word in the core word subset, respectively calculate the probability value of the legal text filed into each server in the server group.
- the probability value of the legal text filed into each server in the server group can be calculated according to the following formula:
- c is the sequence number of each word in the core word subset, 1 ⁇ c ⁇ CoreNum, CoreNum is the number of words in the core word subset, and EigVal c,t is the c-th word in the core word subset
- the characteristic value corresponding to the t-th server, and LawDom t is the probability value of the legal text being filed into the t-th server.
- Step S1043 Determine the server with the largest probability value as the target server.
- the target server can be selected according to the following formula:
- TgtLawDom Argmax(LawDomSq)
- Argmax is the largest independent variable function
- LawDomSq is the first probability value sequence of the legal text
- LawDomSq (LawDom 1 ,LawDom 2 , whil,LawDom t , whil, LawDom T )
- TgtLawDom is the serial number of the target server.
- Step S105 Select an auxiliary word subset from the word set.
- the auxiliary word subset includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, the first word frequency is the frequency that appears in the legal text, and the second word frequency is the The frequency of appearance in the legal text database corresponding to the target server.
- the first word frequency of each word in the word set can be calculated separately according to the following formula:
- FstFrq w is the first word frequency of the wth word in the word set.
- LibWdNum w is the number of times the w-th word in the word set appears in the legal text database corresponding to the target server
- SndFrq w is the second word frequency of the w-th word in the word set.
- each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
- the process of setting the third threshold is similar to the process of setting the first threshold. It is only necessary to replace the term density that appears in it with the ratio of the first term frequency to the second term frequency. For details, please refer to the above content, here No longer.
- Step S106 Determine the target partition of the legal text in the target server according to the auxiliary word subset.
- the target partition is a disk partition used to archive the legal text.
- each legal field can be subdivided into multiple categories. Taking the civil legal field as an example, it can be divided into the following eight categories: (1) Between citizens, between citizens and legal persons due to property rights Most of the disputes refer to disputes over the possession, use, profit and disposal of property. (2) Disputes between citizens due to contractual acts such as sale, lease, loan, gift, pawn, etc. and disputes arising from inheritance. (3) Due to improper gains, no debt disputes due to management, etc. and compensation disputes caused by damage to property. (4) Disputes caused by personal rights mainly refer to the infringement of citizens' health rights, name rights, reputation rights, honor rights and portrait rights.
- each server can be divided into a number of disk partitions, and each disk partition is used to archive a certain type of legal text.
- step S106 may specifically include:
- Step S1061 respectively query the second feature vector of each word in the auxiliary word subset in the preset second word list.
- the second feature vector of each word is composed of components of ST dimensions, each dimension corresponds to a feature value of a disk partition, and ST is the total number of disk partitions in the target server.
- the setting process of the second word list is similar to the setting process of the first word list shown in FIG. 4, and the legal text library corresponding to the target server includes legal text sub-libraries corresponding to each disk partition. Firstly, count the number of occurrences of each word in the legal text library in each legal text sub-database, and then calculate the feature value of each word and each disk partition in the target server according to the following formula:
- st is the disk partition number in the target server, 1 ⁇ st ⁇ ST
- WordNum sw,st is the st disk partition of the swth word that composes the legal text database in the target server
- EigVal sw,st is the feature value corresponding to the sw-th word constituting the legal text general library and the st-th disk partition in the target server.
- SdEigVec sw is the second feature vector of the sw-th word constituting the general database of legal texts.
- Step S1062 according to the second feature vector of each word in the auxiliary word subset, respectively calculate the probability value of the legal text belonging to each disk partition in the target server.
- the probability value of the legal text belonging to each disk partition in the target server can be calculated according to the following formula:
- sub is the sequence number of each word in the auxiliary word subset, 1 ⁇ sub ⁇ SubNum, SubNum is the number of words in the auxiliary word subset, EigVal sub,st is the sub-th word in the auxiliary word subset
- the characteristic value corresponding to the st-th disk partition in the target server, and LawType st is the probability value of the legal text belonging to the st-th disk partition in the target server.
- Step S1063 Determine the disk partition with the largest probability value as the target partition of the legal text in the target server.
- the target partition of the legal text in the target server can be selected according to the following formula:
- Step S107 File the legal text into the target partition in the target server.
- the legal text after receiving the relevant instructions, the legal text can be automatically obtained, and the core of the legal text can be effectively represented from the legal text by means of automatic text analysis.
- Word subset and determine the basis for the server (ie, the target server) that archives the legal text based on this, and then select an auxiliary word subset from the word set, and determine the disk for filing the legal text accordingly Partition (ie, target partition), and file legal texts into the target partition in the target server.
- Partition ie, target partition
- file legal texts into the target partition in the target server.
- FIG. 6 shows a structural diagram of an embodiment of a legal text filing device provided in an embodiment of the present application.
- a legal document filing device may include:
- the legal text obtaining module 601 is configured to receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address;
- the word segmentation processing module 602 is configured to perform word segmentation processing on the legal text to obtain a set of words that make up the legal text;
- the core word subset selection module 603 is configured to select a core word subset from the word set, and the core word subset includes those whose term density is greater than a preset first threshold and their uniformity is greater than a preset second threshold Various words
- the target server determining module 604 is configured to select a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;
- the auxiliary word subset selection module 605 is configured to select an auxiliary word subset from the word set, and the auxiliary word subset includes each word whose ratio of the first word frequency to the second word frequency is greater than the preset third threshold, so The first word frequency is the frequency of appearance in the legal text, and the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;
- a partition determining module 606, configured to determine a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;
- the archiving module 607 is configured to archive the legal text into the target partition in the target server.
- the core word subset selection module may include:
- the term density calculation unit is used to calculate the term density of each word in the word set
- the text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;
- a uniformity calculation unit for calculating the uniformity of each word in the word set
- the core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
- the target server determining module may include:
- the first feature vector query unit is configured to query the first feature vector of each word in the core word subset in the preset first word list;
- a probability value calculation unit configured to calculate the probability value of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;
- the target server determining unit is configured to determine the server with the largest probability value as the target server.
- auxiliary word subset selection module may include:
- the first word frequency calculation unit is configured to calculate the first word frequency of each word in the word set
- the second word frequency calculation unit is configured to calculate the second word frequency of each word in the word set
- the auxiliary word subset selection unit is configured to select, from the word set, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold value to form the auxiliary word subset.
- FIG. 7 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
- the terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- the terminal device 7 may include: a processor 70, a memory 71, and computer-readable instructions 72 stored in the memory 71 and running on the processor 70, for example, a computer-readable instruction for executing the above-mentioned legal text filing method instruction.
- the processor 70 executes the computer-readable instructions 72, the steps in the foregoing legal document filing method embodiments are implemented.
- Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Technology Law (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A method and apparatus for archiving a legal text, a non-volatile computer readable storage medium, and a terminal device. The method comprises: receiving a legal text archiving instruction, extracting a target address in the legal text archiving instruction, and obtaining the legal text in the target address (S101); performing word segmentation on the legal text to obtain a word set constituting the legal text (S102); selecting a core word subset from the word set (S103); selecting a target server from a preset server group according to the core word subset (S104); selecting an auxiliary word subset from the word set (S105), the auxiliary word subset comprising words of which the ratio of a first word frequency to a second word frequency is greater than a preset third threshold; determining a target partition of the legal text in the target server according to the auxiliary word subset (S106); and archiving the legal text into the target partition in the target server (S107).
Description
本申请要求于2019年9月3日提交中国专利局、申请号为201910826813.9、发明名称为“一种法律文本归档方法、装置、可读存储介质及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 3, 2019, the application number is 201910826813.9, and the invention title is "a legal text filing method, device, readable storage medium and terminal equipment". The entire content is incorporated into this application by reference.
本申请属于计算机技术领域,尤其涉及一种法律文本归档方法、装置、计算机非易失性可读存储介质及终端设备。This application belongs to the field of computer technology, and in particular relates to a method and device for filing legal texts, a computer non-volatile readable storage medium, and terminal equipment.
在法院、律所等机构中,往往需要对大量的法律文本及时进行归档处理,以便于后续查询。现有技术中提供了多种对这些法律文本进行归档的方法,例如,可以按照处理人、处理单位以及处理日期等进行归档。这样的归档方法虽然可以使得这些法律文本看起来井然有序,但却并未考虑到这些法律文本内在的关联性,不便于用户进行查询,当用户需要从中查询相关的资料时,往往需要逐个进行查看,耗费大量的人力成本,效率极为低下。In courts, law firms and other institutions, it is often necessary to file a large number of legal texts in a timely manner to facilitate subsequent inquiries. The prior art provides a variety of methods for filing these legal texts. For example, the filing can be performed according to the processor, the processing unit, and the processing date. Although this filing method can make these legal texts look orderly, it does not take into account the inherent relevance of these legal texts, which is not convenient for users to query. When users need to query related materials, they often need to do it one by one. Checking, consumes a lot of manpower costs, and is extremely inefficient.
有鉴于此,本申请实施例提供了一种法律文本归档方法、装置、计算机非易失性可读存储介质及终端设备,以解决现有的法律文本归档方法耗费大量的人力成本,效率极为低下的问题。In view of this, the embodiments of the present application provide a legal text filing method, device, computer non-volatile readable storage medium, and terminal equipment to solve the problem that the existing legal text filing method consumes a lot of labor costs and is extremely inefficient. The problem.
本申请实施例的第一方面提供了一种法律文本归档方法,可以包括:The first aspect of the embodiments of the present application provides a method for filing legal documents, which may include:
接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;
对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;
从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
根据所述核心词子集从预设的服务器群组中选取目标服务器,所述目标服务器为用于对所述法律文本归档的服务器;Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;
从所述词语集合中选取辅助词子集,所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率;A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;
根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区,所述目标分区为用于对所述法律文本归档的磁盘分区;Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;
将所述法律文本归档入所述目标服务器中的所述目标分区。File the legal text into the target partition in the target server.
本申请实施例的第二方面提供了一种法律文本归档装置,可以包括用于实现上述法律文本归档方法的步骤的模块。The second aspect of the embodiments of the present application provides a legal text filing device, which may include modules for implementing the steps of the above-mentioned legal text filing method.
本申请实施例的第三方面提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下上述法律文本归档方法的步骤。A third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When implementing the steps of the above-mentioned legal text filing method.
本申请实施例的第四方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述法律文本归档方法的步骤。The fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer The steps of the above-mentioned legal text filing method are realized when the instructions are readable.
在本申请实施例中,将法律文本按其实际核心内容归档入各个服务器的磁盘分区中,当用户需要查询相关资料时,仅需在对应的服务器的磁盘分区中进行查找即可,节省了对于人力成本的耗费,大大提高了工作效率。In the embodiment of this application, the legal text is archived into the disk partition of each server according to its actual core content. When the user needs to query related materials, he only needs to search in the disk partition of the corresponding server. The labor cost has greatly improved work efficiency.
图1为本申请实施例中一种法律文本归档方法的一个实施例流程图;FIG. 1 is a flowchart of an embodiment of a method for filing legal text in an embodiment of the application;
图2为从词语集合中选取核心词子集的示意流程图;Figure 2 is a schematic flow chart of selecting a core word subset from a word set;
图3为根据核心词子集确定目标服务器的示意流程图;Figure 3 is a schematic flow chart of determining a target server according to a core word subset;
图4为第一词语列表的设置过程的示意流程图;FIG. 4 is a schematic flowchart of the setting process of the first word list;
图5为根据辅助词子集确定法律文本在目标服务器中的类别的示意流程图;Figure 5 is a schematic flow chart of determining the category of the legal text in the target server according to the auxiliary word subset;
图6为本申请实施例中一种法律文本归档装置的一个实施例结构图;Fig. 6 is a structural diagram of an embodiment of a legal document filing device in an embodiment of the application;
图7为本申请实施例中一种终端设备的示意框图。FIG. 7 is a schematic block diagram of a terminal device in an embodiment of the application.
请参阅图1,本申请实施例中一种法律文本归档方法的一个实施例可以包括:Referring to FIG. 1, an embodiment of a method for filing legal texts in an embodiment of the present application may include:
步骤S101、接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本。Step S101: Receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address.
所述法律文本包括但不限于法律条文、法律论文、法律报道、法律分析文章以及法院的起诉书、裁决书等等与法律相关的材料中的文本。The legal texts include, but are not limited to, texts in legal provisions, legal essays, legal reports, legal analysis articles, indictments, rulings, and other legal-related materials.
当用户需要对某一法律文本进行存储时,可以通过人机交互界面向预设的终端设备下发法律文本存储指令,在所述法律文本存储指令中携带着法律文本当前所在的地址,也即所述目标地址。所述目标地址可以是所述终端设备中的某一存储地址,也可以是网络中或者指定的数据库中的某一存储地址。所述终端设备即为本实施例的实施主体,在接收到所述法律文本存储指令之后,所述终端设备可以从中提取出所述目标 地址,并根据所述目标地址从本地、网络或者指定的数据库中获取到法律文本。When a user needs to store a certain legal text, he can issue a legal text storage instruction to a preset terminal device through a human-computer interaction interface. The legal text storage instruction carries the address where the legal text is currently located, that is, The target address. The target address may be a certain storage address in the terminal device, or a certain storage address in the network or a designated database. The terminal device is the implementation subject of this embodiment. After receiving the legal text storage instruction, the terminal device can extract the target address from it, and obtain the target address from the local, network, or designated address according to the target address. The legal text is obtained from the database.
步骤S102、对所述法律文本进行分词处理,得到组成所述法律文本的词语集合。Step S102: Perform word segmentation processing on the legal text to obtain a set of words constituting the legal text.
在进行法律文本归档的过程中,所述终端设备首先会对会对其进行分词处理,得到组成所述法律文本的词语集合。分词处理是指将所述法律文本切分成一个一个单独的词语,在本实施例中,可以采用通用词典与法律专用词典相结合的方式对所述法律文本进行切分,即使用法律专用词典对所述法律文本进行第一轮切分,再使用通用词典对第一轮切分后剩下的法律文本进行切分,通过这样的方式,优先切分出法律专用词语,再切分出通用词语,对于既无法切分出法律专用词语又无法切分出通用词语的法律文本,则切分出单字。In the process of filing the legal text, the terminal device will first perform word segmentation processing on it to obtain a set of words that constitute the legal text. Word segmentation refers to dividing the legal text into individual words. In this embodiment, the general dictionary and the legal dictionary can be combined to segment the legal text, that is, the legal dictionary is used to split the legal text. The legal text is segmented in the first round, and then the general dictionary is used to segment the remaining legal texts after the first round of segmentation. In this way, the legal-specific terms are firstly segmented, and then the general terms are segmented. , For legal texts that cannot be distinguished neither legal terms nor general terms, single words are separated.
步骤S103、从所述词语集合中选取核心词子集。Step S103: Select a core word subset from the word set.
所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语。The core word subset includes each word whose term density is greater than the preset first threshold and the uniformity is greater than the preset second threshold.
如图2所示,步骤S103具体可以包括如下步骤:As shown in FIG. 2, step S103 may specifically include the following steps:
步骤S1031、分别计算所述词语集合中的各个词语的词条密度。Step S1031, respectively calculate the entry density of each word in the word set.
具体地,可以根据下式分别计算所述词语集合中的各个词语的词条密度:Specifically, the entry density of each word in the word set can be calculated according to the following formula:
其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合中的词语数目,WdNum
w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity
w为所述词语集合中的第w个词语的词条密度。
Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines of the legal text, and WdDensity w is the entry density of the w-th word in the word set.
步骤S1032、将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况。Step S1032: Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph.
FN为大于1的整数。文本段落可以根据具体情况来划分,在本实施例的一种具体实现中,可以将所述法律文本中的每KN行作为一个文本段落,即将所述法律文本中的第1行至第KN行作为第一个文本段落,将所述法律文本中的第KN+1行至第2×KN行作为第二个文本段落,将所述法律文本中的第2×KN+1行至第3×KN行作为第三个文本段落,以此类推。则有:
其中,Ceil为向上取整函数。KN的取值可以根据具体情况进行设置,例如,可以将其设置为3、5、10或者其它取值等等。
FN is an integer greater than 1. The text paragraphs can be divided according to specific conditions. In a specific implementation of this embodiment, each KN line in the legal text can be regarded as a text paragraph, that is, the first line to the KN line in the legal text As the first text paragraph, take line KN+1 to line 2×KN in the legal text as the second text paragraph, and change line 2×KN+1 to line 3× in the legal text Line KN is used as the third text paragraph, and so on. Then there are: Among them, Ceil is a round-up function. The value of KN can be set according to specific conditions, for example, it can be set to 3, 5, 10 or other values and so on.
步骤S1033、分别计算所述词语集合中的各个词语的均匀度。Step S1033: Calculate the uniformity of each word in the word set respectively.
具体地,可以根据下式分别计算所述词语集合中的各个词语的均匀度:Specifically, the uniformity of each word in the word set can be calculated according to the following formula:
其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag
w,f为所述词语集合中的第w个词语在第f个文本段落中的出现情况的标志位,且
WdEqu
w为所述词语集合中的第w个词语的均匀度。
Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And WdEqu w is the uniformity of the wth word in the word set.
步骤S1034、从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Step S1034: Select each word with a word density greater than the first threshold and a uniformity greater than the second threshold from the word set to form the core word subset.
所述第一阈值和所述第二阈值的具体取值可以根据实际情况进行设置。The specific values of the first threshold and the second threshold may be set according to actual conditions.
在本实施例的一种具体实现中,可以首先按照取值从大到小的顺序构造如下所示的词条密度序列:In a specific implementation of this embodiment, the following entry density sequence can be constructed first according to the order of value from largest to smallest:
DensitySet={WdDensity
1、WdDensity
2、……、WdDensity
w、……、WdDensity
WN}
DensitySet={WdDensity 1 , WdDensity 2 , ..., WdDensity w , ..., WdDensity WN }
其中,DensitySet即为所述词条密度序列。Wherein, DensitySet is the term density sequence.
然后,按照预设的第一选取比例从所述词条密度序列中选取排序在前的若干个取值,并将选取的取值构造为如下所示的最大词条密度序列:Then, according to the preset first selection ratio, select several values ranked first from the term density sequence, and construct the selected values into the maximum term density sequence as shown below:
MaxDensitySet={MaxWdDensity
1、MaxWdDensity
2、……、MaxWdDensity
nmax、……、MaxWdDensity
MaxNum}
MaxDensitySet={MaxWdDensity 1 , MaxWdDensity 2 , ..., MaxWdDensity nmax , ..., MaxWdDensity MaxNum }
其中,MaxDensitySet为所述最大词条密度序列,MaxNum为所述最大词条密度序列中的取值个数,且MaxNum=WN×η
1,η
1为所述第一选取比例,可以根据实际情况将其设置为0.2、0.3、0.4或者其它取值,nmax为所述最大词条密度序列中的取值序号,1≤nmax≤MaxNum,MaxWdDensity
nmax为所述最大词条密度序列的第nmax个取值。
Wherein, MaxDensitySet is the maximum entry density sequence, MaxNum is the number of values in the maximum entry density sequence, and MaxNum=WN×η 1 , η 1 is the first selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmax is the value sequence number in the maximum entry density sequence, 1≤nmax≤MaxNum, MaxWdDensity nmax is the nmax of the maximum entry density sequence value.
接着,按照预设的第二选取比例从所述词条密度序列中选取排序在后的若干个取值,并将选取的取值构造为如下所示的最小词条密度序列:Then, according to a preset second selection ratio, select several values that are ranked in the end of the term density sequence, and construct the selected values into the minimum term density sequence as shown below:
MinDensitySet={MinWdDensity
1、MinWdDensity
2、……、MinWdDensity
nmin、……、MinWdDensity
MinNum}
MinDensitySet={MinWdDensity 1 , MinWdDensity 2 , ..., MinWdDensity nmin , ..., MinWdDensity MinNum }
其中,MinDensitySet为所述最小词条密度序列,MinNum为所述最小词条密度序列中的取值个数,且MaxNum=WN×η
2,η
2为所述第二选取比例,可以根据实际情况将其设置为0.2、0.3、0.4或者其它取值,nmin为所述最小词条密度序列中的取值序号,1≤nmin≤MinNum,MinWdDensity
nmin为所述最小词条密度序列的第nmin个取值。
Wherein, MinDensitySet is the minimum entry density sequence, MinNum is the number of values in the minimum entry density sequence, and MaxNum=WN×η 2 , η 2 is the second selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmin is the value sequence number in the minimum entry density sequence, 1≤nmin≤MinNum, and MinWdDensity nmin is the nminth value of the minimum entry density sequence value.
再构造如下所示的中值词条密度序列:Then construct the median term density sequence as shown below:
MidDensitySet={MidWdDensity
1、MidWdDensity
2、……、MidWdDensity
nmid、……、 MidWdDensity
MidNum}
MidDensitySet={MidWdDensity 1 , MidWdDensity 2 , ..., MidWdDensity nmid , ..., MidWdDensity MidNum }
其中,MidDensitySet为所述中值词条密度序列,且MidDensitySet=DensitySet-MaxDensitySet-MinDensitySet,MidNum为所述中值词条密度序列中的取值个数,且MidNum=WN×(1-η
1-η
2),nmid为所述中值词条密度序列中的取值序号,1≤nmid≤MidNum,MidWdDensity
nmid为所述中值词条密度序列的第nmid个取值。
Wherein, MidDensitySet is the median term density sequence, and MidDensitySet=DensitySet-MaxDensitySet-MinDensitySet, MidNum is the number of values in the median term density sequence, and MidNum=WN×(1-η 1- η 2 ), nmid is the value sequence number in the median entry density sequence, 1≤nmid≤MidNum, and MidWdDensity nmid is the nmid value in the median entry density sequence.
最后,根据下式计算所述第一阈值:Finally, calculate the first threshold according to the following formula:
其中,λ为预设的系数,且λ>0,FstThresh为所述第一阈值。Where λ is a preset coefficient, and λ>0, FstThresh is the first threshold.
所述第二阈值的设置过程与所述第一阈值的设置过程类似,仅需将其中出现的词条密度替换为均匀度即可,具体可参照上述内容,此处不再赘述。The setting process of the second threshold is similar to the setting process of the first threshold. It is only necessary to replace the density of entries appearing therein with uniformity. For details, please refer to the above content, which will not be repeated here.
步骤S104、根据所述核心词子集从预设的服务器群组中选取目标服务器。Step S104: Select a target server from a preset server group according to the core word subset.
所述目标服务器为用于对所述法律文本归档的服务器。在本实施例中,所述服务器群组可以包括三个服务器,分别用于对民事、刑事、行政这三个法律领域的法律文本进行归档。The target server is a server used to archive the legal text. In this embodiment, the server group may include three servers, which are respectively used to archive legal texts in the three legal fields of civil, criminal, and administrative.
如图3所示,步骤S104具体可以包括:As shown in FIG. 3, step S104 may specifically include:
步骤S1041、在预设的第一词语列表中分别查询所述核心词子集中的各个词语的第一特征向量。Step S1041, respectively query the first feature vector of each word in the core word subset in the preset first word list.
其中,每个词语的第一特征向量均由T个维度的分量组成,每个维度均对应于一个服务器的特征值,T为大于1的整数,对于将所有的法律文本划分为民事、刑事、行政这三个法律领域的情况,则有T=3。Among them, the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to the feature value of a server. T is an integer greater than 1. For dividing all legal texts into civil, criminal, and In the three legal fields of administration, T=3.
对于不同法律领域的法律文本而言,其中的用词往往会存在较大的差异,某些词语会在某一个法律领域中频繁出现,而在其它的法律领域中极少出现,本实施例利用这一特性,预先通过如图4所示的大数据分析过程设置建立所述第一词语列表:For legal texts in different legal fields, the terms used in them tend to be quite different. Some words appear frequently in a certain legal field, but rarely appear in other legal fields. This embodiment uses With this feature, the first word list is established in advance through the big data analysis process as shown in Figure 4:
步骤S10411、对预设的法律文本总库中的各条法律文本进行分词处理,得到组成所述法律文本总库的各个词语。Step S10411: Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database.
所述法律文本总库中包括与各个法律领域分别对应的法律文本库。在所述法律文本总库中尽可能多的包含某一统计时间段内获取的所有法律文本。该统计时间段可以根据实际情况进行设置,例如,可以将其设置为距离当前时刻一周、一个月、一个季度或者一年内的时间段。The general legal text database includes legal text databases corresponding to various legal fields. The total legal text database contains as many legal texts as possible in a certain statistical time period. The statistical time period can be set according to the actual situation, for example, it can be set to a time period within a week, a month, a quarter, or a year from the current moment.
所述法律文本总库中的所有法律文本根据其所属的法律领域会被划分为若干个法律文本库,每个法律文本库均对应于一个法律领域,例如,可以将法律文本总库划分 为民事法律文本库、刑事法律文本库、行政法律文本库等等。相应的,每个法律文本库也均对应于一个对该法律领域进行归档的服务器。All legal texts in the general legal text database will be divided into several legal text databases according to the legal field to which they belong, and each legal text database corresponds to a legal field. For example, the general legal text database can be divided into civil Legal text library, criminal law text library, administrative law text library, etc. Correspondingly, each legal text library also corresponds to a server that archives the legal field.
分词处理的过程与步骤S101中的过程类似,具体可参照步骤S101中的叙述,此处不再赘述。The process of word segmentation is similar to the process in step S101. For details, please refer to the description in step S101, which will not be repeated here.
步骤S10412、分别统计组成所述法律文本总库的各个词语在各个法律文本库中出现的次数。Step S10412: Count the number of occurrences of each word constituting the legal text database in each legal text database.
在本实施例中,可以将组成所述法律文本总库的各个词语在各个法律文本库中出现的次数记为如下所示的序列形式:In this embodiment, the number of occurrences of each word constituting the legal text database in each legal text database can be recorded as a sequence format as shown below:
WNSeq
sw=(WordNum
sw,1,WordNum
sw,2,......,WordNum
sw,t,......,WordNum
sw,T)
WNSeq sw = (WordNum sw,1 ,WordNum sw,2 ,......,WordNum sw,t ,......,WordNum sw,T )
其中,t为所述服务器群组中的各个服务器的序号(也即法律文本库的序号),1≤t≤T,sw为组成所述法律文本总库的各个词语的序号,1≤sw≤SWN,SWN组成所述法律文本总库的词语的总数,WordNum
sw,t为组成所述法律文本总库的第sw个词语在与第t个服务器对应的法律文本库中出现的次数,WNSeq
sw为第sw个词语在各个法律文本库中出现的次数序列。
Where t is the serial number of each server in the server group (that is, the serial number of the legal text database), 1≤t≤T, sw is the serial number of each word composing the legal text database, 1≤sw≤ SWN, SWN is the total number of words that make up the legal text database, WordNum sw,t is the number of times that the swth word that makes up the legal text database appears in the legal text database corresponding to the t-th server, WNSeq sw It is the sequence of the number of occurrences of the swth word in each legal text database.
步骤S10413、分别计算组成所述法律文本总库的各个词语与各个服务器对应的特征值。Step S10413: Calculate the feature values corresponding to each server and each word that composes the legal text database.
具体地,可以根据下式分别计算组成所述法律文本总库的各个词语与各个服务器对应的特征值:Specifically, the feature values corresponding to each server and each word constituting the legal text database can be calculated separately according to the following formula:
其中,ln为自然对数函数,EigVal
sw,t为组成所述法律文本总库的第sw个词语与第t个服务器对应的特征值。
Among them, ln is a natural logarithmic function, and EigVal sw,t is the feature value corresponding to the t-th server and the sw-th word that constitutes the total legal text database.
由该式可以看出,EigVal
sw,t与WordNum
sw,t正相关,即某一词语在某个服务器对应的法律文本库中出现的次数越多,则该词语与该服务器对应的特征值也越高。
As can be seen from this formula, EigVal sw, t and WordNum sw, t positively correlated, ie the more times a word appears in a legal texts corresponding to the server library, the term server corresponding to the eigenvalues is also Higher.
步骤S10414、构造组成所述法律文本总库的各个词语的第一特征向量。Step S10414: Construct a first feature vector of each word composing the legal text database.
具体地,可以根据下式构造组成所述法律文本总库的各个词语的第一特征向量:Specifically, the first feature vector of each word composing the legal text database can be constructed according to the following formula:
EigVec
sw=(EigVal
sw,1,EigVal
sw,2,......,EigVal
sw,t,......,EigVal
sw,T)
EigVec sw = (EigVal sw,1 ,EigVal sw,2 ,......,EigVal sw,t ,......,EigVal sw,T )
其中,EigVec
sw为组成所述法律文本总库的第sw个词语的第一特征向量。
Among them, EigVec sw is the first feature vector of the sw-th word constituting the general database of legal texts.
步骤S10415、将组成所述法律文本总库的各个词语的第一特征向量构造为所述第一词语列表。Step S10415: Construct a first feature vector of each word constituting the legal text database as the first word list.
通过图4所示的过程,即可完成对所述第一词语列表的设置过程,为后续的法律文本归档提供依据。Through the process shown in FIG. 4, the process of setting the first word list can be completed, which provides a basis for the subsequent filing of legal texts.
步骤S1042、根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值。Step S1042, according to the first feature vector of each word in the core word subset, respectively calculate the probability value of the legal text filed into each server in the server group.
具体地,可以根据下式计算所述法律文本归档入所述服务器群组中的各个服务器的概率值:Specifically, the probability value of the legal text filed into each server in the server group can be calculated according to the following formula:
其中,c为所述核心词子集中的各个词语的序号,1≤c≤CoreNum,CoreNum为所述核心词子集中的词语数目,EigVal
c,t为所述核心词子集中的第c个词语与第t个服务器对应的特征值,LawDom
t为所述法律文本归档入第t个服务器的概率值。
Where c is the sequence number of each word in the core word subset, 1≤c≤CoreNum, CoreNum is the number of words in the core word subset, and EigVal c,t is the c-th word in the core word subset The characteristic value corresponding to the t-th server, and LawDom t is the probability value of the legal text being filed into the t-th server.
步骤S1043、将概率值最大的服务器确定为所述目标服务器。Step S1043: Determine the server with the largest probability value as the target server.
具体地,可以根据下式选取出所述目标服务器:Specifically, the target server can be selected according to the following formula:
TgtLawDom=Argmax(LawDomSq)TgtLawDom=Argmax(LawDomSq)
=Argmax(LawDom
1,LawDom
2,......,LawDom
t,......,LawDom
T)
=Argmax(LawDom 1 ,LawDom 2 ,......,LawDom t ,......,LawDom T )
其中,Argmax为最大自变量函数,LawDomSq为所述法律文本的第一概率值序列,且:LawDomSq=(LawDom
1,LawDom
2,......,LawDom
t,......,LawDom
T),TgtLawDom为所述目标服务器的序号。
Among them, Argmax is the largest independent variable function, LawDomSq is the first probability value sequence of the legal text, and: LawDomSq=(LawDom 1 ,LawDom 2 ,......,LawDom t ,......, LawDom T ), TgtLawDom is the serial number of the target server.
步骤S105、从所述词语集合中选取辅助词子集。Step S105: Select an auxiliary word subset from the word set.
所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率。The auxiliary word subset includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, the first word frequency is the frequency that appears in the legal text, and the second word frequency is the The frequency of appearance in the legal text database corresponding to the target server.
具体地,首先可以根据下式分别计算所述词语集合中的各个词语的第一词频:Specifically, first, the first word frequency of each word in the word set can be calculated separately according to the following formula:
其中,FstFrq
w为所述词语集合中的第w个词语的第一词频。
Wherein, FstFrq w is the first word frequency of the wth word in the word set.
然后,根据下式分别计算所述词语集合中的各个词语的第二词频:Then, calculate the second word frequency of each word in the word set according to the following formula:
其中,LibWdNum
w为所述词语集合中的第w个词语在与所述目标服务器对应的法律文本库中出现的次数,SndFrq
w为所述词语集合中的第w个词语的第二词频。
Wherein, LibWdNum w is the number of times the w-th word in the word set appears in the legal text database corresponding to the target server, and SndFrq w is the second word frequency of the w-th word in the word set.
最后,从所述词语集合中选取第一词频与第二词频之比大于所述第三阈值的各个词语组成所述辅助词子集。Finally, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
所述第三阈值的设置过程与所述第一阈值的设置过程类似,仅需将其中出现的词条密度替换为第一词频与第二词频之比即可,具体可参照上述内容,此处不再赘述。The process of setting the third threshold is similar to the process of setting the first threshold. It is only necessary to replace the term density that appears in it with the ratio of the first term frequency to the second term frequency. For details, please refer to the above content, here No longer.
步骤S106、根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区。Step S106: Determine the target partition of the legal text in the target server according to the auxiliary word subset.
所述目标分区为用于对所述法律文本归档的磁盘分区。在本实施例中,每个法律领域又可细分为多个类别,以民事法律领域为例,可以分为以下8个类别:(1)公民之间、公民与法人之间因财产权而发生的纠纷,多数指对财产的占有、使用、收益和处分所发生的纠纷。(2)公民之间因买卖、租赁、借贷、赠与、典当等合同行为而发生的纠纷以及继承遗产所引起的纠纷。(3)因不当得利,无因管理等所产生的债务纠纷以及损坏财产引起的赔偿纠纷。(4)因人身权利引起的纠纷,这主要是指侵害公民健康权、姓名权、名誉权、荣誉权和肖像权。(5)因侵害公民的发明权(专利权)、著作权(版权)而引起的纠纷。(6)婚姻家庭引起的纠纷,主要有离婚以及因离婚引起的财产分割、子女抚养方面的纠纷,家庭成员间的赡养、抚育、扶养等纠纷。(7)因经济合同、企业劳动用工、企业承包、土地承包、相邻权等引起的纠纷。(8)法律规定的或最高人民法院司法解释文件规定的应由人民法院受理的其他民事诉讼案件。本实施例可以将每个服务器均划分为若干个磁盘分区,每个磁盘分区用于对某一类别的法律文本进行归档。The target partition is a disk partition used to archive the legal text. In this embodiment, each legal field can be subdivided into multiple categories. Taking the civil legal field as an example, it can be divided into the following eight categories: (1) Between citizens, between citizens and legal persons due to property rights Most of the disputes refer to disputes over the possession, use, profit and disposal of property. (2) Disputes between citizens due to contractual acts such as sale, lease, loan, gift, pawn, etc. and disputes arising from inheritance. (3) Due to improper gains, no debt disputes due to management, etc. and compensation disputes caused by damage to property. (4) Disputes caused by personal rights mainly refer to the infringement of citizens' health rights, name rights, reputation rights, honor rights and portrait rights. (5) Disputes caused by infringement of citizens' invention rights (patent rights) and copyrights (copyrights). (6) Disputes caused by marriage and family mainly include divorce and property division caused by divorce, disputes about child support, and disputes about support, upbringing, and support among family members. (7) Disputes arising from economic contracts, enterprise labor and employment, enterprise contracting, land contracting, neighboring rights, etc. (8) Other civil litigation cases that shall be accepted by the people's court as prescribed by law or the judicial interpretation documents of the Supreme People's Court. In this embodiment, each server can be divided into a number of disk partitions, and each disk partition is used to archive a certain type of legal text.
如图5所示,步骤S106具体可以包括:As shown in FIG. 5, step S106 may specifically include:
步骤S1061、在预设的第二词语列表中分别查询所述辅助词子集中的各个词语的第二特征向量。Step S1061, respectively query the second feature vector of each word in the auxiliary word subset in the preset second word list.
其中,每个词语的第二特征向量均由ST个维度的分量组成,每个维度均对应于一个磁盘分区的特征值,ST为所述目标服务器中的磁盘分区总数。Wherein, the second feature vector of each word is composed of components of ST dimensions, each dimension corresponds to a feature value of a disk partition, and ST is the total number of disk partitions in the target server.
所述第二词语列表的设置过程与图4所示的所述第一词语列表的设置过程类似,与所述目标服务器对应的法律文本库中包括与各个磁盘分区分别对应的法律文本子库,首先分别统计所述法律文本总库的各个词语在各个法律文本子库中出现的次数,然后可以根据下式分别计算各个词语与所述目标服务器中的各个磁盘分区对应的特征值:The setting process of the second word list is similar to the setting process of the first word list shown in FIG. 4, and the legal text library corresponding to the target server includes legal text sub-libraries corresponding to each disk partition. Firstly, count the number of occurrences of each word in the legal text library in each legal text sub-database, and then calculate the feature value of each word and each disk partition in the target server according to the following formula:
其中,st为所述目标服务器中的磁盘分区序号,1≤st≤ST,WordNum
sw,st为组成所述法律文本总库的第sw个词语在与所述目标服务器中的第st个磁盘分区对应的法律文本子库中出现的次数,EigVal
sw,st为组成所述法律文本总库的第sw个词语与所述目标服务器中的第st个磁盘分区对应的特征值。
Wherein, st is the disk partition number in the target server, 1≤st≤ST, WordNum sw,st is the st disk partition of the swth word that composes the legal text database in the target server Corresponding to the number of occurrences in the legal text sub-library, EigVal sw,st is the feature value corresponding to the sw-th word constituting the legal text general library and the st-th disk partition in the target server.
最后,根据下式构造组成所述法律文本总库的各个词语的第二特征向量,并将组成所述法律文本总库的各个词语的第二特征向量构造为所述第二词语列表:Finally, construct the second feature vector of each word composing the legal text database according to the following formula, and construct the second feature vector of each word composing the legal text database as the second word list:
SdEigVec
sw=(EigVal
sw,1,EigVal
sw,2,......,EigVal
sw,st,......,EigVal
sw,ST)
SdEigVec sw = (EigVal sw,1 ,EigVal sw,2 ,......,EigVal sw,st ,......,EigVal sw,ST )
其中,SdEigVec
sw为组成所述法律文本总库的第sw个词语的第二特征向量。
Wherein, SdEigVec sw is the second feature vector of the sw-th word constituting the general database of legal texts.
步骤S1062、根据所述辅助词子集中的各个词语的第二特征向量分别计算所述法律文本属于所述目标服务器中的各个磁盘分区的概率值。Step S1062, according to the second feature vector of each word in the auxiliary word subset, respectively calculate the probability value of the legal text belonging to each disk partition in the target server.
具体地,可以根据下式计算所述法律文本属于所述目标服务器中的各个磁盘分区的概率值:Specifically, the probability value of the legal text belonging to each disk partition in the target server can be calculated according to the following formula:
其中,sub为所述辅助词子集中的各个词语的序号,1≤sub≤SubNum,SubNum为所述辅助词子集中的词语数目,EigVal
sub,st为所述辅助词子集中的第sub个词语与所述目标服务器中的第st个磁盘分区对应的特征值,LawType
st为所述法律文本属于所述目标服务器中的第st个磁盘分区的概率值。
Wherein, sub is the sequence number of each word in the auxiliary word subset, 1≤sub≤SubNum, SubNum is the number of words in the auxiliary word subset, EigVal sub,st is the sub-th word in the auxiliary word subset The characteristic value corresponding to the st-th disk partition in the target server, and LawType st is the probability value of the legal text belonging to the st-th disk partition in the target server.
步骤S1063、将概率值最大的磁盘分区确定为所述法律文本在所述目标服务器中的目标分区。Step S1063: Determine the disk partition with the largest probability value as the target partition of the legal text in the target server.
具体地,可以根据下式选取出所述法律文本在所述目标服务器中的目标分区:Specifically, the target partition of the legal text in the target server can be selected according to the following formula:
TgtLawType=Argmax(LawTypeSq)TgtLawType=Argmax(LawTypeSq)
=Argmax(LawType
1,LawType
2,......,LawType
st,......,LawType
ST)
=Argmax(LawType 1 ,LawType 2 ,......,LawType st ,......,LawType ST )
其中,LawTypeSq为所述法律文本的第二概率值序列,且:LawTypeSq=(LawType
1,LawType
2,......,LawType
st,......,LawType
ST),TgtLawType为所述法律文本在所述目标服务器中的目标分区的序号。
Among them, LawTypeSq is the second probability value sequence of the legal text, and: LawTypeSq=(LawType 1 ,LawType 2 ,......,LawType st ,......,LawType ST ), TgtLawType is the result The serial number of the target partition of the legal text in the target server.
步骤S107、将所述法律文本归档入所述目标服务器中的所述目标分区。Step S107: File the legal text into the target partition in the target server.
综上所述,在本申请实施例中,在接收到相关指令后,可以自动获取法律文本,并通过文本自动化分析的方式,自动从法律文本中选取出可以有效地表征法律文本核心内容的核心词子集,并据此确定对所述法律文本归档的服务器(即目标服务器)的依据,接着,从所述词语集合中选取辅助词子集,并据此确定对所述法律文本归档的磁盘分区(即目标分区),并将法律文本归档入所述目标服务器中的所述目标分区。通过这样的方式,将法律文本按其实际核心内容归档入各个服务器的磁盘分区中,当用户需要查询相关资料时,仅需在对应的服务器的磁盘分区中进行查找即可,节省了对于人力成本的耗费,大大提高了工作效率。In summary, in the embodiments of this application, after receiving the relevant instructions, the legal text can be automatically obtained, and the core of the legal text can be effectively represented from the legal text by means of automatic text analysis. Word subset, and determine the basis for the server (ie, the target server) that archives the legal text based on this, and then select an auxiliary word subset from the word set, and determine the disk for filing the legal text accordingly Partition (ie, target partition), and file legal texts into the target partition in the target server. In this way, the legal text is archived in the disk partition of each server according to its actual core content. When the user needs to query related materials, he only needs to search in the disk partition of the corresponding server, saving labor costs. The cost, greatly improving work efficiency.
对应于上文实施例所述的一种法律文本归档方法,图6示出了本申请实施例提供 的一种法律文本归档装置的一个实施例结构图。Corresponding to the legal text filing method described in the above embodiment, FIG. 6 shows a structural diagram of an embodiment of a legal text filing device provided in an embodiment of the present application.
本实施例中,一种法律文本归档装置可以包括:In this embodiment, a legal document filing device may include:
法律文本获取模块601,用于接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本;The legal text obtaining module 601 is configured to receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address;
分词处理模块602,用于对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;The word segmentation processing module 602 is configured to perform word segmentation processing on the legal text to obtain a set of words that make up the legal text;
核心词子集选取模块603,用于从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;The core word subset selection module 603 is configured to select a core word subset from the word set, and the core word subset includes those whose term density is greater than a preset first threshold and their uniformity is greater than a preset second threshold Various words
目标服务器确定模块604,用于根据所述核心词子集从预设的服务器群组中选取目标服务器,所述目标服务器为用于对所述法律文本归档的服务器;The target server determining module 604 is configured to select a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;
辅助词子集选取模块605,用于从所述词语集合中选取辅助词子集,所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率;The auxiliary word subset selection module 605 is configured to select an auxiliary word subset from the word set, and the auxiliary word subset includes each word whose ratio of the first word frequency to the second word frequency is greater than the preset third threshold, so The first word frequency is the frequency of appearance in the legal text, and the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;
分区确定模块606,用于根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区,所述目标分区为用于对所述法律文本归档的磁盘分区;A partition determining module 606, configured to determine a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;
归档模块607,用于将所述法律文本归档入所述目标服务器中的所述目标分区。The archiving module 607 is configured to archive the legal text into the target partition in the target server.
进一步地,所述核心词子集选取模块可以包括:Further, the core word subset selection module may include:
词条密度计算单元,用于分别计算所述词语集合中的各个词语的词条密度;The term density calculation unit is used to calculate the term density of each word in the word set;
文本段落划分单元,用于将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;The text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;
均匀度计算单元,用于分别计算所述词语集合中的各个词语的均匀度;A uniformity calculation unit for calculating the uniformity of each word in the word set;
核心词子集选取单元,用于从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
进一步地,所述目标服务器确定模块可以包括:Further, the target server determining module may include:
第一特征向量查询单元,用于在预设的第一词语列表中分别查询所述核心词子集中的各个词语的第一特征向量;The first feature vector query unit is configured to query the first feature vector of each word in the core word subset in the preset first word list;
概率值计算单元,用于根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值;A probability value calculation unit, configured to calculate the probability value of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;
目标服务器确定单元,用于将概率值最大的服务器确定为所述目标服务器。The target server determining unit is configured to determine the server with the largest probability value as the target server.
进一步地,所述辅助词子集选取模块可以包括:Further, the auxiliary word subset selection module may include:
第一词频计算单元,用于分别计算所述词语集合中的各个词语的第一词频;The first word frequency calculation unit is configured to calculate the first word frequency of each word in the word set;
第二词频计算单元,用于分别计算所述词语集合中的各个词语的第二词频;The second word frequency calculation unit is configured to calculate the second word frequency of each word in the word set;
辅助词子集选取单元,用于从所述词语集合中选取第一词频与第二词频之比大于所述第三阈值的各个词语组成所述辅助词子集。The auxiliary word subset selection unit is configured to select, from the word set, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold value to form the auxiliary word subset.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置,模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working processes of the above described devices, modules and units can refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
图7示出了本申请实施例提供的一种终端设备的示意框图,为了便于说明,仅示出了与本申请实施例相关的部分。FIG. 7 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
在本实施例中,所述终端设备7可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备7可包括:处理器70、存储器71以及存储在所述存储器71中并可在所述处理器70上运行的计算机可读指令72,例如执行上述的法律文本归档方法的计算机可读指令。所述处理器70执行所述计算机可读指令72时实现上述各个法律文本归档方法实施例中的步骤。In this embodiment, the terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device 7 may include: a processor 70, a memory 71, and computer-readable instructions 72 stored in the memory 71 and running on the processor 70, for example, a computer-readable instruction for executing the above-mentioned legal text filing method instruction. When the processor 70 executes the computer-readable instructions 72, the steps in the foregoing legal document filing method embodiments are implemented.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the method of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.
Claims (20)
- 一种法律文本归档方法,其特征在于,包括:A method for filing legal texts, which is characterized in that it includes:接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;根据所述核心词子集从预设的服务器群组中选取目标服务器,所述目标服务器为用于对所述法律文本归档的服务器;Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;从所述词语集合中选取辅助词子集,所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率;A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区,所述目标分区为用于对所述法律文本归档的磁盘分区;Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;将所述法律文本归档入所述目标服务器中的所述目标分区。File the legal text into the target partition in the target server.
- 根据权利要求1所述的法律文本归档方法,其特征在于,所述从所述词语集合中选取核心词子集包括:The legal text filing method according to claim 1, wherein said selecting a core word subset from said word set comprises:根据下式分别计算所述词语集合中的各个词语的词条密度:Calculate the entry density of each word in the word set according to the following formula:其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合中的词语数目,WdNum w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity w为所述词语集合中的第w个词语的词条密度; Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.根据下式分别计算所述词语集合中的各个词语的均匀度:Calculate the uniformity of each word in the word set according to the following formula:其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag w,f为所述词语集合中的第w个词语在第f个文本段落中的出现情况的标志位,且 WdEqu w为所述词语集合中 的第w个词语的均匀度; Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And WdEqu w is the uniformity of the wth word in the word set;从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
- 根据权利要求1所述的法律文本归档方法,其特征在于,所述根据所述核心词子集从预设的服务器群组中选取目标服务器包括:The legal text filing method according to claim 1, wherein the selecting a target server from a preset server group according to the core word subset comprises:在预设的第一词语列表中分别查询所述核心词子集中的各个词语的第一特征向量,其中,每个词语的第一特征向量均由T个维度的分量组成,每个维度均对应于一个服务器的特征值,T为大于1的整数;Query the first feature vector of each word in the core word subset in the preset first word list, where the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to The characteristic value of a server, T is an integer greater than 1;根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值;Respectively calculating the probability values of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;将概率值最大的服务器确定为所述目标服务器。The server with the largest probability value is determined as the target server.
- 根据权利要求3所述的法律文本归档方法,其特征在于,所述根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值包括:The method for filing legal texts according to claim 3, wherein the first feature vector of each word in the core word subset is calculated to file the legal text into each server in the server group. The probability values of include:根据下式计算所述法律文本归档入所述服务器群组中的各个服务器的概率值:Calculate the probability value of the legal text filed into each server in the server group according to the following formula:其中,t为所述服务器群组中的各个服务器的序号,1≤t≤T,c为所述核心词子集中的各个词语的序号,1≤c≤CoreNum,CoreNum为所述核心词子集中的词语数目,EigVal c,t为所述核心词子集中的第c个词语与第t个服务器对应的特征值,LawDom t为所述法律文本归档入第t个服务器的概率值。 Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, and CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
- 根据权利要求1至4中任一项所述的法律文本归档方法,其特征在于,所述从所述词语集合中选取辅助词子集包括:The legal text filing method according to any one of claims 1 to 4, wherein the selecting a subset of auxiliary words from the word set comprises:分别计算所述词语集合中的各个词语的第一词频;Respectively calculating the first word frequency of each word in the word set;分别计算所述词语集合中的各个词语的第二词频;Respectively calculating the second word frequency of each word in the word set;从所述词语集合中选取第一词频与第二词频之比大于所述第三阈值的各个词语组成所述辅助词子集。Each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
- 一种法律文本归档装置,其特征在于,包括:A device for filing legal texts, characterized in that it comprises:法律文本获取模块,用于接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本;The legal text acquisition module is configured to receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address;分词处理模块,用于对所述法律文本进行分词处理,得到组成所述法律文本的词 语集合;The word segmentation processing module is used to perform word segmentation processing on the legal text to obtain a set of words constituting the legal text;核心词子集选取模块,用于从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;The core word subset selection module is configured to select a core word subset from the word set. The core word subset includes each item whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold Words目标服务器确定模块,用于根据所述核心词子集从预设的服务器群组中选取目标服务器,所述目标服务器为用于对所述法律文本归档的服务器;A target server determining module, configured to select a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;辅助词子集选取模块,用于从所述词语集合中选取辅助词子集,所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率;The auxiliary word subset selection module is configured to select an auxiliary word subset from the word set, and the auxiliary word subset includes each word whose ratio between the first word frequency and the second word frequency is greater than a preset third threshold, the The first word frequency is the frequency of appearance in the legal text, and the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;分区确定模块,用于根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区,所述目标分区为用于对所述法律文本归档的磁盘分区;A partition determining module, configured to determine a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;归档模块,用于将所述法律文本归档入所述目标服务器中的所述目标分区。The archiving module is used for archiving the legal text into the target partition in the target server.
- 根据权利要求6所述的法律文本归档装置,其特征在于,所述核心词子集选取模块包括:The legal text filing device according to claim 6, wherein the core word subset selection module comprises:词条密度计算单元,用于分别计算所述词语集合中的各个词语的词条密度;The term density calculation unit is used to calculate the term density of each word in the word set;文本段落划分单元,用于将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;The text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;均匀度计算单元,用于分别计算所述词语集合中的各个词语的均匀度;A uniformity calculation unit for calculating the uniformity of each word in the word set;核心词子集选取单元,用于从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
- 根据权利要求6所述的法律文本归档装置,其特征在于,所述目标服务器确定模块包括:The legal document filing device according to claim 6, wherein the target server determining module comprises:第一特征向量查询单元,用于在预设的第一词语列表中分别查询所述核心词子集中的各个词语的第一特征向量,其中,每个词语的第一特征向量均由T个维度的分量组成,每个维度均对应于一个服务器的特征值,T为大于1的整数;The first feature vector query unit is configured to query the first feature vector of each word in the core word subset in the preset first word list, wherein the first feature vector of each word has T dimensions Each dimension corresponds to the characteristic value of a server, and T is an integer greater than 1;概率值计算单元,用于根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值;A probability value calculation unit, configured to calculate the probability value of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;目标服务器确定单元,用于将概率值最大的服务器确定为所述目标服务器。The target server determining unit is configured to determine the server with the largest probability value as the target server.
- 根据权利要求8所述的法律文本归档装置,其特征在于,所述概率值计算单元具体用于根据下式计算所述法律文本归档入所述服务器群组中的各个服务器的概率值:The legal text filing device according to claim 8, wherein the probability value calculation unit is specifically configured to calculate the probability value of the legal text filed into each server in the server group according to the following formula:其中,t为所述服务器群组中的各个服务器的序号,1≤t≤T,c为所述核心词子集中的各个词语的序号,1≤c≤CoreNum,CoreNum为所述核心词子集中的词语数目,EigVal c,t为所述核心词子集中的第c个词语与第t个服务器对应的特征值,LawDom t为所述法律文本归档入第t个服务器的概率值。 Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
- 根据权利要求6至9中任一项所述的法律文本归档装置,其特征在于,所述辅助词子集选取模块包括:The legal text filing device according to any one of claims 6 to 9, wherein the auxiliary word subset selection module comprises:第一词频计算单元,用于分别计算所述词语集合中的各个词语的第一词频;The first word frequency calculation unit is configured to calculate the first word frequency of each word in the word set;第二词频计算单元,用于分别计算所述词语集合中的各个词语的第二词频;The second word frequency calculation unit is configured to calculate the second word frequency of each word in the word set;辅助词子集选取单元,用于从所述词语集合中选取第一词频与第二词频之比大于所述第三阈值的各个词语组成所述辅助词子集。The auxiliary word subset selection unit is configured to select, from the word set, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold value to form the auxiliary word subset.
- 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A computer non-volatile readable storage medium, the computer non-volatile readable storage medium storing computer readable instructions, characterized in that the computer readable instructions are executed by a processor to implement the following steps:接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;根据所述核心词子集从预设的服务器群组中选取目标服务器,所述目标服务器为用于对所述法律文本归档的服务器;Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;从所述词语集合中选取辅助词子集,所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率;A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区,所述目标分区为用于对所述法律文本归档的磁盘分区;Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;将所述法律文本归档入所述目标服务器中的所述目标分区。File the legal text into the target partition in the target server.
- 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述从所述词语集合中选取核心词子集包括:The computer non-volatile readable storage medium according to claim 11, wherein the selecting a core word subset from the word set comprises:分别计算所述词语集合中的各个词语的词条密度;Respectively calculating the entry density of each word in the word set;将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在 各个文本段落中的出现情况,FN为大于1的整数;Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than one;分别计算所述词语集合中的各个词语的均匀度;Respectively calculating the uniformity of each word in the word set;从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
- 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述根据所述核心词子集从预设的服务器群组中选取目标服务器包括:The computer non-volatile readable storage medium according to claim 11, wherein the selecting a target server from a preset server group according to the core word subset comprises:在预设的第一词语列表中分别查询所述核心词子集中的各个词语的第一特征向量,其中,每个词语的第一特征向量均由T个维度的分量组成,每个维度均对应于一个服务器的特征值,T为大于1的整数;Query the first feature vector of each word in the core word subset in the preset first word list, where the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to The characteristic value of a server, T is an integer greater than 1;根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值;Respectively calculating the probability values of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;将概率值最大的服务器确定为所述目标服务器。The server with the largest probability value is determined as the target server.
- 根据权利要求13所述的计算机非易失性可读存储介质,其特征在于,所述根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值包括:The computer non-volatile readable storage medium according to claim 13, wherein the legal text is calculated and filed into the server group according to the first feature vector of each word in the core word subset. The probability values of each server in the group include:根据下式计算所述法律文本归档入所述服务器群组中的各个服务器的概率值:Calculate the probability value of the legal text filed into each server in the server group according to the following formula:其中,t为所述服务器群组中的各个服务器的序号,1≤t≤T,c为所述核心词子集中的各个词语的序号,1≤c≤CoreNum,CoreNum为所述核心词子集中的词语数目,EigVal c,t为所述核心词子集中的第c个词语与第t个服务器对应的特征值,LawDom t为所述法律文本归档入第t个服务器的概率值。 Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, and CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
- 根据权利要求11至14中任一项所述的计算机非易失性可读存储介质,其特征在于,所述从所述词语集合中选取辅助词子集包括:The computer non-volatile readable storage medium according to any one of claims 11 to 14, wherein the selecting a subset of auxiliary words from the word set comprises:分别计算所述词语集合中的各个词语的第一词频;Respectively calculating the first word frequency of each word in the word set;分别计算所述词语集合中的各个词语的第二词频;Respectively calculating the second word frequency of each word in the word set;从所述词语集合中选取第一词频与第二词频之比大于所述第三阈值的各个词语组成所述辅助词子集。Each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
- 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A terminal device, comprising a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, wherein the processor executes the computer-readable instructions as follows step:接收法律文本归档指令,提取所述法律文本归档指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;根据所述核心词子集从预设的服务器群组中选取目标服务器,所述目标服务器为用于对所述法律文本归档的服务器;Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;从所述词语集合中选取辅助词子集,所述辅助词子集中包括第一词频与第二词频之比大于预设的第三阈值的各个词语,所述第一词频为在所述法律文本中出现的频率,所述第二词频为在与所述目标服务器对应的法律文本库中出现的频率;A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;根据所述辅助词子集确定所述法律文本在所述目标服务器中的目标分区,所述目标分区为用于对所述法律文本归档的磁盘分区;Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;将所述法律文本归档入所述目标服务器中的所述目标分区。File the legal text into the target partition in the target server.
- 根据权利要求16所述的终端设备,其特征在于,所述从所述词语集合中选取核心词子集包括:The terminal device according to claim 16, wherein the selecting a core word subset from the word set comprises:分别计算所述词语集合中的各个词语的词条密度;Respectively calculating the entry density of each word in the word set;将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.分别计算所述词语集合中的各个词语的均匀度;Respectively calculating the uniformity of each word in the word set;从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
- 根据权利要求16所述的终端设备,其特征在于,所述根据所述核心词子集从预设的服务器群组中选取目标服务器包括:The terminal device according to claim 16, wherein the selecting a target server from a preset server group according to the core word subset comprises:在预设的第一词语列表中分别查询所述核心词子集中的各个词语的第一特征向量,其中,每个词语的第一特征向量均由T个维度的分量组成,每个维度均对应于一个服务器的特征值,T为大于1的整数;Query the first feature vector of each word in the core word subset in the preset first word list, where the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to The characteristic value of a server, T is an integer greater than 1;根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值;Respectively calculating the probability values of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;将概率值最大的服务器确定为所述目标服务器。The server with the largest probability value is determined as the target server.
- 根据权利要求18所述的终端设备,其特征在于,所述根据所述核心词子集中的各个词语的第一特征向量分别计算所述法律文本归档入所述服务器群组中的各个服务器的概率值包括:18. The terminal device according to claim 18, wherein the probability that the legal text is filed into each server in the server group is calculated according to the first feature vector of each word in the core word subset. Values include:根据下式计算所述法律文本归档入所述服务器群组中的各个服务器的概率值:Calculate the probability value of the legal text filed into each server in the server group according to the following formula:其中,t为所述服务器群组中的各个服务器的序号,1≤t≤T,c为所述核心词子集中的各个词语的序号,1≤c≤CoreNum,CoreNum为所述核心词子集中的词语数目,EigVal c,t为所述核心词子集中的第c个词语与第t个服务器对应的特征值,LawDom t为所述法律文本归档入第t个服务器的概率值。 Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
- 根据权利要求16至19中任一项所述的终端设备,其特征在于,所述从所述词语集合中选取辅助词子集包括:The terminal device according to any one of claims 16 to 19, wherein the selecting a subset of auxiliary words from the word set comprises:分别计算所述词语集合中的各个词语的第一词频;Respectively calculating the first word frequency of each word in the word set;分别计算所述词语集合中的各个词语的第二词频;Respectively calculating the second word frequency of each word in the word set;从所述词语集合中选取第一词频与第二词频之比大于所述第三阈值的各个词语组成所述辅助词子集。Each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910826813.9A CN110750493B (en) | 2019-09-03 | 2019-09-03 | Legal text filing method and device, readable storage medium and terminal equipment |
CN201910826813.9 | 2019-09-03 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021042554A1 true WO2021042554A1 (en) | 2021-03-11 |
Family
ID=69275998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/118148 WO2021042554A1 (en) | 2019-09-03 | 2019-11-13 | Method and apparatus for archiving legal text, readable storage medium, and terminal device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110750493B (en) |
WO (1) | WO2021042554A1 (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150169745A1 (en) * | 2012-03-30 | 2015-06-18 | Ubic, Inc. | Document Sorting System, Document Sorting Method, and Document Sorting Program |
CN107783989A (en) * | 2016-08-25 | 2018-03-09 | 北京国双科技有限公司 | Document belongs to the determination method and apparatus in field |
CN108009284A (en) * | 2017-12-22 | 2018-05-08 | 重庆邮电大学 | Using the Law Text sorting technique of semi-supervised convolutional neural networks |
CN108984518A (en) * | 2018-06-11 | 2018-12-11 | 人民法院信息技术服务中心 | A kind of file classification method towards judgement document |
CN109344400A (en) * | 2018-09-18 | 2019-02-15 | 江苏润桐数据服务有限公司 | A kind of judgment method and device of document storage |
CN109460468A (en) * | 2018-10-23 | 2019-03-12 | 出门问问信息科技有限公司 | Classifying method, categorization arrangement and the corresponding electronic equipment of law related text |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6985908B2 (en) * | 2001-11-01 | 2006-01-10 | Matsushita Electric Industrial Co., Ltd. | Text classification apparatus |
EP2384476A4 (en) * | 2009-01-30 | 2013-01-23 | Cbs Interactive Inc | Personalization engine for building a user profile |
US9002838B2 (en) * | 2009-12-17 | 2015-04-07 | Wausau Financial Systems, Inc. | Distributed capture system for use with a legacy enterprise content management system |
US9483557B2 (en) * | 2011-03-04 | 2016-11-01 | Microsoft Technology Licensing Llc | Keyword generation for media content |
US8442951B1 (en) * | 2011-12-07 | 2013-05-14 | International Business Machines Corporation | Processing archive content based on hierarchical classification levels |
CN109062972A (en) * | 2018-06-29 | 2018-12-21 | 平安科技(深圳)有限公司 | Web page classification method, device and computer readable storage medium |
CN109033212B (en) * | 2018-07-01 | 2021-09-07 | 上海新诤信知识产权服务股份有限公司 | Text classification method based on similarity matching |
CN109413192A (en) * | 2018-11-08 | 2019-03-01 | 内蒙古伊泰煤炭股份有限公司 | Data processing method, device, server and readable storage medium storing program for executing |
-
2019
- 2019-09-03 CN CN201910826813.9A patent/CN110750493B/en active Active
- 2019-11-13 WO PCT/CN2019/118148 patent/WO2021042554A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150169745A1 (en) * | 2012-03-30 | 2015-06-18 | Ubic, Inc. | Document Sorting System, Document Sorting Method, and Document Sorting Program |
CN107783989A (en) * | 2016-08-25 | 2018-03-09 | 北京国双科技有限公司 | Document belongs to the determination method and apparatus in field |
CN108009284A (en) * | 2017-12-22 | 2018-05-08 | 重庆邮电大学 | Using the Law Text sorting technique of semi-supervised convolutional neural networks |
CN108984518A (en) * | 2018-06-11 | 2018-12-11 | 人民法院信息技术服务中心 | A kind of file classification method towards judgement document |
CN109344400A (en) * | 2018-09-18 | 2019-02-15 | 江苏润桐数据服务有限公司 | A kind of judgment method and device of document storage |
CN109460468A (en) * | 2018-10-23 | 2019-03-12 | 出门问问信息科技有限公司 | Classifying method, categorization arrangement and the corresponding electronic equipment of law related text |
Also Published As
Publication number | Publication date |
---|---|
CN110750493A (en) | 2020-02-04 |
CN110750493B (en) | 2022-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021114810A1 (en) | Graph structure-based official document recommendation method, apparatus, computer device, and medium | |
CN105760474B (en) | Method and system for extracting feature words of document set based on position information | |
CN109783787A (en) | A kind of generation method of structured document, device and storage medium | |
CN109635082B (en) | Policy influence analysis method, device, computer equipment and storage medium | |
WO2017166912A1 (en) | Method and device for extracting core words from commodity short text | |
CN107122382B (en) | Patent classification method based on specification | |
US8725781B2 (en) | Sentiment cube | |
CN112507711B (en) | Text abstract extraction method and system | |
WO2020248379A1 (en) | Method for searching for similar network pages, and apparatus | |
CN110019792A (en) | File classification method and device and sorter model training method | |
CN108170666A (en) | A kind of improved method based on TF-IDF keyword extractions | |
US20140006369A1 (en) | Processing structured and unstructured data | |
WO2021027162A1 (en) | Non-full-cell table content extraction method and apparatus, and terminal device | |
CN106776695A (en) | The method for realizing the automatic identification of secretarial document value | |
CN109462635B (en) | Information pushing method, computer readable storage medium and server | |
WO2022105178A1 (en) | Keyword extraction method and related device | |
CN116610853A (en) | Search recommendation method, search recommendation system, computer device, and storage medium | |
Ła̧giewka et al. | Distributed image retrieval with colour and keypoint features | |
WO2021042554A1 (en) | Method and apparatus for archiving legal text, readable storage medium, and terminal device | |
Savyanavar et al. | Multi-document summarization using TF-IDF Algorithm | |
Song et al. | A novel automatic ontology construction method based on web data | |
CN109783175B (en) | Application icon management method and device, readable storage medium and terminal equipment | |
WO2021042511A1 (en) | Legal text storage method and device, readable storage medium and terminal device | |
WO2022257455A1 (en) | Determination metod and apparatus for similar text, and terminal device and storage medium | |
CN111401047A (en) | Method and device for generating dispute focus of legal document and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19944435 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19944435 Country of ref document: EP Kind code of ref document: A1 |