WO2021042554A1

WO2021042554A1 - Method and apparatus for archiving legal text, readable storage medium, and terminal device

Info

Publication number: WO2021042554A1
Application number: PCT/CN2019/118148
Authority: WO
Inventors: 周剀; 文莉
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-03
Filing date: 2019-11-13
Publication date: 2021-03-11
Also published as: CN110750493A; CN110750493B

Abstract

A method and apparatus for archiving a legal text, a non-volatile computer readable storage medium, and a terminal device. The method comprises: receiving a legal text archiving instruction, extracting a target address in the legal text archiving instruction, and obtaining the legal text in the target address (S101); performing word segmentation on the legal text to obtain a word set constituting the legal text (S102); selecting a core word subset from the word set (S103); selecting a target server from a preset server group according to the core word subset (S104); selecting an auxiliary word subset from the word set (S105), the auxiliary word subset comprising words of which the ratio of a first word frequency to a second word frequency is greater than a preset third threshold; determining a target partition of the legal text in the target server according to the auxiliary word subset (S106); and archiving the legal text into the target partition in the target server (S107).

Description

Method, device, readable storage medium and terminal equipment for filing legal text

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 3, 2019, the application number is 201910826813.9, and the invention title is "a legal text filing method, device, readable storage medium and terminal equipment". The entire content is incorporated into this application by reference.

Technical field

This application belongs to the field of computer technology, and in particular relates to a method and device for filing legal texts, a computer non-volatile readable storage medium, and terminal equipment.

Background technique

In courts, law firms and other institutions, it is often necessary to file a large number of legal texts in a timely manner to facilitate subsequent inquiries. The prior art provides a variety of methods for filing these legal texts. For example, the filing can be performed according to the processor, the processing unit, and the processing date. Although this filing method can make these legal texts look orderly, it does not take into account the inherent relevance of these legal texts, which is not convenient for users to query. When users need to query related materials, they often need to do it one by one. Checking, consumes a lot of manpower costs, and is extremely inefficient.

technical problem

In view of this, the embodiments of the present application provide a legal text filing method, device, computer non-volatile readable storage medium, and terminal equipment to solve the problem that the existing legal text filing method consumes a lot of labor costs and is extremely inefficient. The problem.

Technical solutions

The first aspect of the embodiments of the present application provides a method for filing legal documents, which may include:

Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;

A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;

Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;

File the legal text into the target partition in the target server.

The second aspect of the embodiments of the present application provides a legal text filing device, which may include modules for implementing the steps of the above-mentioned legal text filing method.

A third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When implementing the steps of the above-mentioned legal text filing method.

The fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer The steps of the above-mentioned legal text filing method are realized when the instructions are readable.

Beneficial effect

In the embodiment of this application, the legal text is archived into the disk partition of each server according to its actual core content. When the user needs to query related materials, he only needs to search in the disk partition of the corresponding server. The labor cost has greatly improved work efficiency.

Description of the drawings

FIG. 1 is a flowchart of an embodiment of a method for filing legal text in an embodiment of the application;

Figure 2 is a schematic flow chart of selecting a core word subset from a word set;

Figure 3 is a schematic flow chart of determining a target server according to a core word subset;

FIG. 4 is a schematic flowchart of the setting process of the first word list;

Figure 5 is a schematic flow chart of determining the category of the legal text in the target server according to the auxiliary word subset;

Fig. 6 is a structural diagram of an embodiment of a legal document filing device in an embodiment of the application;

FIG. 7 is a schematic block diagram of a terminal device in an embodiment of the application.

Embodiments of the present invention

Referring to FIG. 1, an embodiment of a method for filing legal texts in an embodiment of the present application may include:

Step S101: Receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address.

The legal texts include, but are not limited to, texts in legal provisions, legal essays, legal reports, legal analysis articles, indictments, rulings, and other legal-related materials.

When a user needs to store a certain legal text, he can issue a legal text storage instruction to a preset terminal device through a human-computer interaction interface. The legal text storage instruction carries the address where the legal text is currently located, that is, The target address. The target address may be a certain storage address in the terminal device, or a certain storage address in the network or a designated database. The terminal device is the implementation subject of this embodiment. After receiving the legal text storage instruction, the terminal device can extract the target address from it, and obtain the target address from the local, network, or designated address according to the target address. The legal text is obtained from the database.

Step S102: Perform word segmentation processing on the legal text to obtain a set of words constituting the legal text.

In the process of filing the legal text, the terminal device will first perform word segmentation processing on it to obtain a set of words that constitute the legal text. Word segmentation refers to dividing the legal text into individual words. In this embodiment, the general dictionary and the legal dictionary can be combined to segment the legal text, that is, the legal dictionary is used to split the legal text. The legal text is segmented in the first round, and then the general dictionary is used to segment the remaining legal texts after the first round of segmentation. In this way, the legal-specific terms are firstly segmented, and then the general terms are segmented. , For legal texts that cannot be distinguished neither legal terms nor general terms, single words are separated.

Step S103: Select a core word subset from the word set.

The core word subset includes each word whose term density is greater than the preset first threshold and the uniformity is greater than the preset second threshold.

As shown in FIG. 2, step S103 may specifically include the following steps:

Step S1031, respectively calculate the entry density of each word in the word set.

Specifically, the entry density of each word in the word set can be calculated according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum _w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines of the legal text, and WdDensity _w is the entry density of the w-th word in the word set.

Step S1032: Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph.

FN is an integer greater than 1. The text paragraphs can be divided according to specific conditions. In a specific implementation of this embodiment, each KN line in the legal text can be regarded as a text paragraph, that is, the first line to the KN line in the legal text As the first text paragraph, take line KN+1 to line 2×KN in the legal text as the second text paragraph, and change line 2×KN+1 to line 3× in the legal text Line KN is used as the third text paragraph, and so on. Then there are:

Among them, Ceil is a round-up function. The value of KN can be set according to specific conditions, for example, it can be set to 3, 5, 10 or other values and so on.

Step S1033: Calculate the uniformity of each word in the word set respectively.

Specifically, the uniformity of each word in the word set can be calculated according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag _w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And

WdEqu _w is the uniformity of the wth word in the word set.

Step S1034: Select each word with a word density greater than the first threshold and a uniformity greater than the second threshold from the word set to form the core word subset.

The specific values of the first threshold and the second threshold may be set according to actual conditions.

In a specific implementation of this embodiment, the following entry density sequence can be constructed first according to the order of value from largest to smallest:

DensitySet={WdDensity ₁ , WdDensity ₂ , ..., WdDensity _w , ..., WdDensity _WN }

Wherein, DensitySet is the term density sequence.

Then, according to the preset first selection ratio, select several values ranked first from the term density sequence, and construct the selected values into the maximum term density sequence as shown below:

MaxDensitySet={MaxWdDensity ₁ , MaxWdDensity ₂ , ..., MaxWdDensity _nmax , ..., MaxWdDensity _MaxNum }

Wherein, MaxDensitySet is the maximum entry density sequence, MaxNum is the number of values in the maximum entry density sequence, and MaxNum=WN×η ₁ , η ₁ is the first selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmax is the value sequence number in the maximum entry density sequence, 1≤nmax≤MaxNum, MaxWdDensity _nmax is the nmax of the maximum entry density sequence value.

Then, according to a preset second selection ratio, select several values that are ranked in the end of the term density sequence, and construct the selected values into the minimum term density sequence as shown below:

MinDensitySet={MinWdDensity ₁ , MinWdDensity ₂ , ..., MinWdDensity _nmin , ..., MinWdDensity _MinNum }

Wherein, MinDensitySet is the minimum entry density sequence, MinNum is the number of values in the minimum entry density sequence, and MaxNum=WN×η ₂ , η ₂ is the second selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmin is the value sequence number in the minimum entry density sequence, 1≤nmin≤MinNum, and MinWdDensity _nmin is the nminth value of the minimum entry density sequence value.

Then construct the median term density sequence as shown below:

MidDensitySet={MidWdDensity ₁ , MidWdDensity ₂ , ..., MidWdDensity _nmid , ..., MidWdDensity _MidNum }

Wherein, MidDensitySet is the median term density sequence, and MidDensitySet=DensitySet-MaxDensitySet-MinDensitySet, MidNum is the number of values in the median term density sequence, and MidNum=WN×(1-η _1- η ₂ ), nmid is the value sequence number in the median entry density sequence, 1≤nmid≤MidNum, and MidWdDensity _nmid is the nmid value in the median entry density sequence.

Finally, calculate the first threshold according to the following formula:

Where λ is a preset coefficient, and λ>0, FstThresh is the first threshold.

The setting process of the second threshold is similar to the setting process of the first threshold. It is only necessary to replace the density of entries appearing therein with uniformity. For details, please refer to the above content, which will not be repeated here.

Step S104: Select a target server from a preset server group according to the core word subset.

The target server is a server used to archive the legal text. In this embodiment, the server group may include three servers, which are respectively used to archive legal texts in the three legal fields of civil, criminal, and administrative.

As shown in FIG. 3, step S104 may specifically include:

Step S1041, respectively query the first feature vector of each word in the core word subset in the preset first word list.

Among them, the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to the feature value of a server. T is an integer greater than 1. For dividing all legal texts into civil, criminal, and In the three legal fields of administration, T=3.

For legal texts in different legal fields, the terms used in them tend to be quite different. Some words appear frequently in a certain legal field, but rarely appear in other legal fields. This embodiment uses With this feature, the first word list is established in advance through the big data analysis process as shown in Figure 4:

Step S10411: Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database.

The general legal text database includes legal text databases corresponding to various legal fields. The total legal text database contains as many legal texts as possible in a certain statistical time period. The statistical time period can be set according to the actual situation, for example, it can be set to a time period within a week, a month, a quarter, or a year from the current moment.

All legal texts in the general legal text database will be divided into several legal text databases according to the legal field to which they belong, and each legal text database corresponds to a legal field. For example, the general legal text database can be divided into civil Legal text library, criminal law text library, administrative law text library, etc. Correspondingly, each legal text library also corresponds to a server that archives the legal field.

The process of word segmentation is similar to the process in step S101. For details, please refer to the description in step S101, which will not be repeated here.

Step S10412: Count the number of occurrences of each word constituting the legal text database in each legal text database.

In this embodiment, the number of occurrences of each word constituting the legal text database in each legal text database can be recorded as a sequence format as shown below:

WNSeq _sw = (WordNum _sw,1 ,WordNum _sw,2 ,......,WordNum _sw,t ,......,WordNum _sw,T )

Where t is the serial number of each server in the server group (that is, the serial number of the legal text database), 1≤t≤T, sw is the serial number of each word composing the legal text database, 1≤sw≤ SWN, SWN is the total number of words that make up the legal text database, WordNum _sw,t is the number of times that the swth word that makes up the legal text database appears in the legal text database corresponding to the t-th server, WNSeq _sw It is the sequence of the number of occurrences of the swth word in each legal text database.

Step S10413: Calculate the feature values corresponding to each server and each word that composes the legal text database.

Specifically, the feature values corresponding to each server and each word constituting the legal text database can be calculated separately according to the following formula:

Among them, ln is a natural logarithmic function, and EigVal _sw,t is the feature value corresponding to the t-th server and the sw-th word that constitutes the total legal text database.

As can be seen from this _formula, EigVal _{sw, t} and WordNum _{sw, t} positively correlated, ie the more times a word appears in a legal texts corresponding to the server library, the term server corresponding to the eigenvalues is also Higher.

Step S10414: Construct a first feature vector of each word composing the legal text database.

Specifically, the first feature vector of each word composing the legal text database can be constructed according to the following formula:

EigVec _sw = (EigVal _sw,1 ,EigVal _sw,2 ,......,EigVal _sw,t ,......,EigVal _sw,T )

Among them, EigVec _sw is the first feature vector of the sw-th word constituting the general database of legal texts.

Step S10415: Construct a first feature vector of each word constituting the legal text database as the first word list.

Through the process shown in FIG. 4, the process of setting the first word list can be completed, which provides a basis for the subsequent filing of legal texts.

Step S1042, according to the first feature vector of each word in the core word subset, respectively calculate the probability value of the legal text filed into each server in the server group.

Specifically, the probability value of the legal text filed into each server in the server group can be calculated according to the following formula:

Where c is the sequence number of each word in the core word subset, 1≤c≤CoreNum, CoreNum is the number of words in the core word subset, and EigVal _c,t is the c-th word in the core word subset The characteristic value corresponding to the t-th server, and LawDom _t is the probability value of the legal text being filed into the t-th server.

Step S1043: Determine the server with the largest probability value as the target server.

Specifically, the target server can be selected according to the following formula:

TgtLawDom=Argmax(LawDomSq)

=Argmax(LawDom ₁ ,LawDom ₂ ,......,LawDom _t ,......,LawDom _T )

Among them, Argmax is the largest independent variable function, LawDomSq is the first probability value sequence of the legal text, and: LawDomSq=(LawDom ₁ ,LawDom ₂ ,......,LawDom _t ,......, LawDom _T ), TgtLawDom is the serial number of the target server.

Step S105: Select an auxiliary word subset from the word set.

The auxiliary word subset includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, the first word frequency is the frequency that appears in the legal text, and the second word frequency is the The frequency of appearance in the legal text database corresponding to the target server.

Specifically, first, the first word frequency of each word in the word set can be calculated separately according to the following formula:

Wherein, FstFrq _w is the first word frequency of the wth word in the word set.

Then, calculate the second word frequency of each word in the word set according to the following formula:

Wherein, LibWdNum _w is the number of times the w-th word in the word set appears in the legal text database corresponding to the target server, and SndFrq _w is the second word frequency of the w-th word in the word set.

Finally, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.

The process of setting the third threshold is similar to the process of setting the first threshold. It is only necessary to replace the term density that appears in it with the ratio of the first term frequency to the second term frequency. For details, please refer to the above content, here No longer.

Step S106: Determine the target partition of the legal text in the target server according to the auxiliary word subset.

The target partition is a disk partition used to archive the legal text. In this embodiment, each legal field can be subdivided into multiple categories. Taking the civil legal field as an example, it can be divided into the following eight categories: (1) Between citizens, between citizens and legal persons due to property rights Most of the disputes refer to disputes over the possession, use, profit and disposal of property. (2) Disputes between citizens due to contractual acts such as sale, lease, loan, gift, pawn, etc. and disputes arising from inheritance. (3) Due to improper gains, no debt disputes due to management, etc. and compensation disputes caused by damage to property. (4) Disputes caused by personal rights mainly refer to the infringement of citizens' health rights, name rights, reputation rights, honor rights and portrait rights. (5) Disputes caused by infringement of citizens' invention rights (patent rights) and copyrights (copyrights). (6) Disputes caused by marriage and family mainly include divorce and property division caused by divorce, disputes about child support, and disputes about support, upbringing, and support among family members. (7) Disputes arising from economic contracts, enterprise labor and employment, enterprise contracting, land contracting, neighboring rights, etc. (8) Other civil litigation cases that shall be accepted by the people's court as prescribed by law or the judicial interpretation documents of the Supreme People's Court. In this embodiment, each server can be divided into a number of disk partitions, and each disk partition is used to archive a certain type of legal text.

As shown in FIG. 5, step S106 may specifically include:

Step S1061, respectively query the second feature vector of each word in the auxiliary word subset in the preset second word list.

Wherein, the second feature vector of each word is composed of components of ST dimensions, each dimension corresponds to a feature value of a disk partition, and ST is the total number of disk partitions in the target server.

The setting process of the second word list is similar to the setting process of the first word list shown in FIG. 4, and the legal text library corresponding to the target server includes legal text sub-libraries corresponding to each disk partition. Firstly, count the number of occurrences of each word in the legal text library in each legal text sub-database, and then calculate the feature value of each word and each disk partition in the target server according to the following formula:

Wherein, st is the disk partition number in the target server, 1≤st≤ST, WordNum _sw,st is the st disk partition of the swth word that composes the legal text database in the target server Corresponding to the number of occurrences in the legal text sub-library, EigVal _sw,st is the feature value corresponding to the sw-th word constituting the legal text general library and the st-th disk partition in the target server.

Finally, construct the second feature vector of each word composing the legal text database according to the following formula, and construct the second feature vector of each word composing the legal text database as the second word list:

SdEigVec _sw = (EigVal _sw,1 ,EigVal _sw,2 ,......,EigVal _sw,st ,......,EigVal _sw,ST )

Wherein, SdEigVec _sw is the second feature vector of the sw-th word constituting the general database of legal texts.

Step S1062, according to the second feature vector of each word in the auxiliary word subset, respectively calculate the probability value of the legal text belonging to each disk partition in the target server.

Specifically, the probability value of the legal text belonging to each disk partition in the target server can be calculated according to the following formula:

Wherein, sub is the sequence number of each word in the auxiliary word subset, 1≤sub≤SubNum, SubNum is the number of words in the auxiliary word subset, EigVal _sub,st is the sub-th word in the auxiliary word subset The characteristic value corresponding to the st-th disk partition in the target server, and LawType _st is the probability value of the legal text belonging to the st-th disk partition in the target server.

Step S1063: Determine the disk partition with the largest probability value as the target partition of the legal text in the target server.

Specifically, the target partition of the legal text in the target server can be selected according to the following formula:

TgtLawType=Argmax(LawTypeSq)

=Argmax(LawType ₁ ,LawType ₂ ,......,LawType _st ,......,LawType _ST )

Among them, LawTypeSq is the second probability value sequence of the legal text, and: LawTypeSq=(LawType ₁ ,LawType ₂ ,......,LawType _st ,......,LawType _ST ), TgtLawType is the result The serial number of the target partition of the legal text in the target server.

Step S107: File the legal text into the target partition in the target server.

In summary, in the embodiments of this application, after receiving the relevant instructions, the legal text can be automatically obtained, and the core of the legal text can be effectively represented from the legal text by means of automatic text analysis. Word subset, and determine the basis for the server (ie, the target server) that archives the legal text based on this, and then select an auxiliary word subset from the word set, and determine the disk for filing the legal text accordingly Partition (ie, target partition), and file legal texts into the target partition in the target server. In this way, the legal text is archived in the disk partition of each server according to its actual core content. When the user needs to query related materials, he only needs to search in the disk partition of the corresponding server, saving labor costs. The cost, greatly improving work efficiency.

Corresponding to the legal text filing method described in the above embodiment, FIG. 6 shows a structural diagram of an embodiment of a legal text filing device provided in an embodiment of the present application.

In this embodiment, a legal document filing device may include:

The legal text obtaining module 601 is configured to receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address;

The word segmentation processing module 602 is configured to perform word segmentation processing on the legal text to obtain a set of words that make up the legal text;

The core word subset selection module 603 is configured to select a core word subset from the word set, and the core word subset includes those whose term density is greater than a preset first threshold and their uniformity is greater than a preset second threshold Various words

The target server determining module 604 is configured to select a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;

The auxiliary word subset selection module 605 is configured to select an auxiliary word subset from the word set, and the auxiliary word subset includes each word whose ratio of the first word frequency to the second word frequency is greater than the preset third threshold, so The first word frequency is the frequency of appearance in the legal text, and the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;

A partition determining module 606, configured to determine a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;

The archiving module 607 is configured to archive the legal text into the target partition in the target server.

Further, the core word subset selection module may include:

The term density calculation unit is used to calculate the term density of each word in the word set;

The text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;

A uniformity calculation unit for calculating the uniformity of each word in the word set;

The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.

Further, the target server determining module may include:

The first feature vector query unit is configured to query the first feature vector of each word in the core word subset in the preset first word list;

A probability value calculation unit, configured to calculate the probability value of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;

The target server determining unit is configured to determine the server with the largest probability value as the target server.

Further, the auxiliary word subset selection module may include:

The first word frequency calculation unit is configured to calculate the first word frequency of each word in the word set;

The second word frequency calculation unit is configured to calculate the second word frequency of each word in the word set;

The auxiliary word subset selection unit is configured to select, from the word set, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold value to form the auxiliary word subset.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working processes of the above described devices, modules and units can refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

FIG. 7 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.

In this embodiment, the terminal device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device 7 may include: a processor 70, a memory 71, and computer-readable instructions 72 stored in the memory 71 and running on the processor 70, for example, a computer-readable instruction for executing the above-mentioned legal text filing method instruction. When the processor 70 executes the computer-readable instructions 72, the steps in the foregoing legal document filing method embodiments are implemented.

A person of ordinary skill in the art can understand that all or part of the processes in the method of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for filing legal texts, which is characterized in that it includes:

Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;

A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;

Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;

File the legal text into the target partition in the target server.
The legal text filing method according to claim 1, wherein said selecting a core word subset from said word set comprises:

Calculate the entry density of each word in the word set according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;

Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.

Calculate the uniformity of each word in the word set according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
WdEqu w is the uniformity of the wth word in the word set;

Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
The legal text filing method according to claim 1, wherein the selecting a target server from a preset server group according to the core word subset comprises:

Query the first feature vector of each word in the core word subset in the preset first word list, where the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to The characteristic value of a server, T is an integer greater than 1;

Respectively calculating the probability values of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;

The server with the largest probability value is determined as the target server.
The method for filing legal texts according to claim 3, wherein the first feature vector of each word in the core word subset is calculated to file the legal text into each server in the server group. The probability values of include:

Calculate the probability value of the legal text filed into each server in the server group according to the following formula:

Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, and CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
The legal text filing method according to any one of claims 1 to 4, wherein the selecting a subset of auxiliary words from the word set comprises:

Respectively calculating the first word frequency of each word in the word set;

Respectively calculating the second word frequency of each word in the word set;

Each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
A device for filing legal texts, characterized in that it comprises:

The legal text acquisition module is configured to receive a legal text filing instruction, extract the target address in the legal text filing instruction, and obtain the legal text in the target address;

The word segmentation processing module is used to perform word segmentation processing on the legal text to obtain a set of words constituting the legal text;

The core word subset selection module is configured to select a core word subset from the word set. The core word subset includes each item whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold Words

A target server determining module, configured to select a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;

The auxiliary word subset selection module is configured to select an auxiliary word subset from the word set, and the auxiliary word subset includes each word whose ratio between the first word frequency and the second word frequency is greater than a preset third threshold, the The first word frequency is the frequency of appearance in the legal text, and the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;

A partition determining module, configured to determine a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;

The archiving module is used for archiving the legal text into the target partition in the target server.
The legal text filing device according to claim 6, wherein the core word subset selection module comprises:

The term density calculation unit is used to calculate the term density of each word in the word set;

The text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;

A uniformity calculation unit for calculating the uniformity of each word in the word set;

The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
The legal document filing device according to claim 6, wherein the target server determining module comprises:

The first feature vector query unit is configured to query the first feature vector of each word in the core word subset in the preset first word list, wherein the first feature vector of each word has T dimensions Each dimension corresponds to the characteristic value of a server, and T is an integer greater than 1;

A probability value calculation unit, configured to calculate the probability value of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;

The target server determining unit is configured to determine the server with the largest probability value as the target server.
The legal text filing device according to claim 8, wherein the probability value calculation unit is specifically configured to calculate the probability value of the legal text filed into each server in the server group according to the following formula:

Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
The legal text filing device according to any one of claims 6 to 9, wherein the auxiliary word subset selection module comprises:

The first word frequency calculation unit is configured to calculate the first word frequency of each word in the word set;

The second word frequency calculation unit is configured to calculate the second word frequency of each word in the word set;

The auxiliary word subset selection unit is configured to select, from the word set, each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold value to form the auxiliary word subset.
A computer non-volatile readable storage medium, the computer non-volatile readable storage medium storing computer readable instructions, characterized in that the computer readable instructions are executed by a processor to implement the following steps:

Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;

A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;

Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;

File the legal text into the target partition in the target server.
The computer non-volatile readable storage medium according to claim 11, wherein the selecting a core word subset from the word set comprises:

Respectively calculating the entry density of each word in the word set;

Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than one;

Respectively calculating the uniformity of each word in the word set;

Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
The computer non-volatile readable storage medium according to claim 11, wherein the selecting a target server from a preset server group according to the core word subset comprises:

Query the first feature vector of each word in the core word subset in the preset first word list, where the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to The characteristic value of a server, T is an integer greater than 1;

Respectively calculating the probability values of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;

The server with the largest probability value is determined as the target server.
The computer non-volatile readable storage medium according to claim 13, wherein the legal text is calculated and filed into the server group according to the first feature vector of each word in the core word subset. The probability values of each server in the group include:

Calculate the probability value of the legal text filed into each server in the server group according to the following formula:

Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, and CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
The computer non-volatile readable storage medium according to any one of claims 11 to 14, wherein the selecting a subset of auxiliary words from the word set comprises:

Respectively calculating the first word frequency of each word in the word set;

Respectively calculating the second word frequency of each word in the word set;

Each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.
A terminal device, comprising a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Receiving a legal text filing instruction, extracting the target address in the legal text filing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Selecting a target server from a preset server group according to the core word subset, where the target server is a server for archiving the legal text;

A subset of auxiliary words is selected from the set of words, the subset of auxiliary words includes each word whose ratio of the first word frequency to the second word frequency is greater than a preset third threshold, and the first word frequency is in the legal text The frequency of appearance in, the second word frequency is the frequency of appearance in the legal text database corresponding to the target server;

Determining a target partition of the legal text in the target server according to the auxiliary word subset, where the target partition is a disk partition for archiving the legal text;

File the legal text into the target partition in the target server.
The terminal device according to claim 16, wherein the selecting a core word subset from the word set comprises:

Respectively calculating the entry density of each word in the word set;

Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.

Respectively calculating the uniformity of each word in the word set;

Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
The terminal device according to claim 16, wherein the selecting a target server from a preset server group according to the core word subset comprises:

Query the first feature vector of each word in the core word subset in the preset first word list, where the first feature vector of each word is composed of components of T dimensions, and each dimension corresponds to The characteristic value of a server, T is an integer greater than 1;

Respectively calculating the probability values of the legal text filed into each server in the server group according to the first feature vector of each word in the core word subset;

The server with the largest probability value is determined as the target server.
18. The terminal device according to claim 18, wherein the probability that the legal text is filed into each server in the server group is calculated according to the first feature vector of each word in the core word subset. Values include:

Calculate the probability value of the legal text filed into each server in the server group according to the following formula:

Where t is the serial number of each server in the server group, 1≤t≤T, c is the serial number of each word in the core word subset, 1≤c≤CoreNum, CoreNum is the core word subset EigVal c,t is the feature value of the c-th word in the core word subset corresponding to the t-th server, and LawDom t is the probability value of the legal text filed into the t-th server.
The terminal device according to any one of claims 16 to 19, wherein the selecting a subset of auxiliary words from the word set comprises:

Respectively calculating the first word frequency of each word in the word set;

Respectively calculating the second word frequency of each word in the word set;

Each word whose ratio of the first word frequency to the second word frequency is greater than the third threshold is selected from the word set to form the auxiliary word subset.