WO2021042511A1

WO2021042511A1 - Legal text storage method and device, readable storage medium and terminal device

Info

Publication number: WO2021042511A1
Application number: PCT/CN2019/116635
Authority: WO
Inventors: 周剀; 周萌
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-03
Filing date: 2019-11-08
Publication date: 2021-03-11
Also published as: CN110765230A; CN110765230B

Abstract

Provided are a legal text storage method and device, a computer non-volatile readable storage medium and a terminal device. According to the method, after a related instruction is received, a legal text is automatically obtained, a core word subset that can effectively represent the core content of the legal text is automatically selected from the legal text through automatic text analysis, a vector distance between the core word subset and each feature word set is calculated by means of word vectors, the vector distance is taken as the basis for determining a storage partition in which the legal text should be stored, the storage partition corresponding to the feature word set with the minimum vector distance from the core word subset is selected as a preferred storage partition, and the legal text is stored in the preferred storage partition. When a user needs to query related material, the user only needs to search in the corresponding storage partition, which saves labor costs, and greatly improves the working efficiency.

Description

Method, device, readable storage medium and terminal equipment for storing legal text

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 3, 2019, the application number is 201910826805.4, and the invention title is "a legal text storage method, device, readable storage medium, and terminal equipment". The entire content is incorporated into this application by reference.

Technical field

This application belongs to the field of computer technology, and in particular relates to a legal text storage method, device, computer non-volatile readable storage medium, and terminal equipment.

Background technique

Legal practitioners tend to accumulate a large number of legal texts in their daily legal work. The prior art provides a variety of methods to store these legal texts in an orderly manner. For example, they can be sorted in ascending or descending order according to time, size, name, etc. Storage. Although this storage method can make these legal texts look orderly, it does not take into account the inherent relevance of these legal texts, which is not convenient for users to query. When users need to query related materials, they often need to do it one by one. Checking, consumes a lot of manpower costs, and is extremely inefficient.

technical problem

In view of this, the embodiments of the present application provide a legal text storage method, device, computer non-volatile readable storage medium, and terminal equipment to solve the problem that the existing legal text storage is inconvenient for users to query.

Technical solutions

The first aspect of the embodiments of the present application provides a legal text storage method, which may include:

Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector

Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;

The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.

The second aspect of the embodiments of the present application provides a legal text storage device, which may include a module for implementing the steps of the foregoing legal text storage method.

A third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When realizing the steps of the above legal text storage method.

The fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer The steps of the above legal text storage method are realized when the instructions are readable.

Beneficial effect

In the embodiment of this application, the actual core content of the legal text is stored, and legal texts with similar content will be stored in the same storage partition. When the user needs to query related materials, he only needs to search in the corresponding storage partition. However, the labor cost is saved, and the work efficiency is greatly improved.

Description of the drawings

FIG. 1 is a flowchart of an embodiment of a method for storing legal text in an embodiment of this application;

Figure 2 is a schematic flow chart of selecting a core word subset from a word set;

FIG. 3 is a schematic flowchart of the setting process of the first word vector database;

4 is a structural diagram of an embodiment of a legal text storage device in an embodiment of the application;

Fig. 5 is a schematic block diagram of a terminal device in an embodiment of the application.

Embodiments of the present invention

Referring to FIG. 1, an embodiment of a method for storing legal text in an embodiment of the present application may include:

Step S101: Receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address.

The legal texts include, but are not limited to, texts in legal provisions, legal essays, legal reports, legal analysis articles, indictments, rulings, and other legal-related materials.

When a user needs to store a certain legal text, he can issue a legal text storage instruction to a preset terminal device through a human-computer interaction interface. The legal text storage instruction carries the address where the legal text is currently located, that is, The target address. The target address may be a certain storage address in the terminal device, or a certain storage address in the network or a designated database. The terminal device is the implementation subject of this embodiment. After receiving the legal text storage instruction, the terminal device can extract the target address from it, and obtain the target address from the local, network, or designated address according to the target address. The legal text is obtained from the database.

Step S102: Perform word segmentation processing on the legal text to obtain a set of words constituting the legal text.

In the process of storing the legal text, the terminal device will first perform word segmentation processing on it to obtain a set of words that constitute the legal text. Word segmentation refers to dividing the legal text into individual words. In this embodiment, the general dictionary and the legal dictionary can be combined to segment the legal text, that is, the legal dictionary is used to split the legal text. The legal text is segmented in the first round, and then the general dictionary is used to segment the remaining legal texts after the first round of segmentation. In this way, the legal-specific terms are firstly segmented, and then the general terms are segmented. , For legal texts that cannot be distinguished neither legal terms nor general terms, single words are separated.

Step S103: Select a core word subset from the word set.

The core word subset includes each word whose term density is greater than the preset first threshold and the uniformity is greater than the preset second threshold.

As shown in FIG. 2, step S103 may specifically include the following steps:

Step S1031, respectively calculate the entry density of each word in the word set.

Specifically, the entry density of each word in the word set can be calculated according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum _w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines of the legal text, and WdDensity _w is the entry density of the w-th word in the word set.

Step S1032: Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph.

FN is an integer greater than 1. The text paragraphs can be divided according to specific conditions. In a specific implementation of this embodiment, each KN line in the legal text can be regarded as a text paragraph, that is, the first line to the KN line in the legal text As the first text paragraph, take line KN+1 to line 2×KN in the legal text as the second text paragraph, and change line 2×KN+1 to line 3× in the legal text Line KN is used as the third text paragraph, and so on. Then there are:

Among them, Ceil is a round-up function. The value of KN can be set according to specific conditions, for example, it can be set to 3, 5, 10 or other values and so on.

Step S1033: Calculate the uniformity of each word in the word set respectively.

Specifically, the uniformity of each word in the word set can be calculated according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag _w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And

WdEqu _w is the uniformity of the wth word in the word set.

Step S1034: Select each word with a word density greater than the first threshold and a uniformity greater than the second threshold from the word set to form the core word subset.

The specific values of the first threshold and the second threshold may be set according to actual conditions.

In a specific implementation of this embodiment, the following entry density sequence can be constructed first according to the order of value from largest to smallest:

DensitySet={WdDensity ₁ , WdDensity ₂ , ..., WdDensity _w , ..., WdDensity _WN }

Wherein, DensitySet is the term density sequence.

Then, according to the preset first selection ratio, select several values ranked first from the term density sequence, and construct the selected values into the maximum term density sequence as shown below:

MaxDensitySet={MaxWdDensity ₁ , MaxWdDensity ₂ , ..., MaxWdDensity _nmax , ..., MaxWdDensity _MaxNum }

Wherein, MaxDensitySet is the maximum entry density sequence, MaxNum is the number of values in the maximum entry density sequence, and MaxNum=WN×η ₁ , η ₁ is the first selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmax is the value sequence number in the maximum entry density sequence, 1≤nmax≤MaxNum, MaxWdDensity _nmax is the nmax of the maximum entry density sequence value.

Then, according to a preset second selection ratio, select several values that are ranked in the end of the term density sequence, and construct the selected values into the minimum term density sequence as shown below:

MinDensitySet={MinWdDensity ₁ , MinWdDensity ₂ , ..., MinWdDensity _nmin , ..., MinWdDensity _MinNum }

Wherein, MinDensitySet is the minimum entry density sequence, MinNum is the number of values in the minimum entry density sequence, and MaxNum=WN×η ₂ , η ₂ is the second selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmin is the value sequence number in the minimum entry density sequence, 1≤nmin≤MinNum, and MinWdDensity _nmin is the nminth value of the minimum entry density sequence value.

Then construct the median term density sequence as shown below:

MidDensitySet={MidWdDensity ₁ , MidWdDensity ₂ , ..., MidWdDensity _nmid , ..., MidWdDensity _MidNum }

Wherein, MidDensitySet is the median term density sequence, and MidDensitySet=DensitySet-MaxDensitySet-MinDensitySet, MidNum is the number of values in the median term density sequence, and MidNum=WN×(1-η _1- η ₂ ), nmid is the value sequence number in the median entry density sequence, 1≤nmid≤MidNum, and MidWdDensity _nmid is the nmid value in the median entry density sequence.

Finally, calculate the first threshold according to the following formula:

Where λ is a preset coefficient, and λ>0, FstThresh is the first threshold.

The setting process of the second threshold is similar to the setting process of the first threshold. It is only necessary to replace the density of entries appearing therein with uniformity. For details, please refer to the above content, which will not be repeated here.

Step S104: Obtain each feature word set corresponding to each storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Words vector of words.

In this embodiment, all legal texts can be divided into multiple storage partitions according to actual conditions. Here, the total number of storage partitions is recorded as TN. For example, all legal texts can be divided into three storage areas: civil, criminal, and administrative, that is, TN=3.

For each storage partition, the corresponding feature word set can be set in advance. For example, the feature word set corresponding to the civil storage partition can be set as: {civil, company, contract, liability, loan, compensation, interest, accident, insurance }, set the characteristic word set corresponding to the criminal storage partition as: {criminal, criminal, fixed-term imprisonment, life imprisonment, victim, sentence}, and set the characteristic word set corresponding to the administrative storage partition as: {administration, government, procedure , Trademark, property}. It should be noted that the above is only a specific example of the feature word set setting. In practical applications, other feature word sets can also be set according to the actual situation, which is not specifically limited in this embodiment.

Any word vector database is a database that records the correspondence between words and word vectors. The word vector may be a corresponding word vector obtained by training the word according to the word2vec model. That is, the probability of occurrence of the word is expressed according to the context information of the word. The training of word vectors is still based on the idea of word2vec. First, each word is represented as a 0-1 vector (one-hot) form, and then the word2vec model is trained with the word vector, and n-1 words are used to predict the nth word , The intermediate process obtained after the neural network model prediction is used as the word vector. Specifically, for example, the one-hot vector of "celebration" is assumed to be [1,0,0,0,...,0], and the one-hot vector of "meeting" is [0,1,0,0,... …,0], the one-hot vector for "smooth" is [0,0,1,0,……,0], the vector for predicting "closing" [0,0,0,1,……,0], The model is trained to generate the coefficient matrix W of the hidden layer. The product of the one-hot vector of each word and the coefficient matrix is the word vector of the word. The final form will be similar to "Celebrate [-0.28,0.34,-0.02, …...,0.92]" such a multi-dimensional vector.

Many open-source word vector databases are provided in the prior art, but these word vector databases are commonly used in various fields and are not specially set up for legal texts. Therefore, if they are used directly, the accuracy of the final classification results will be reduced, and if they are used according to the word2vec model Retraining a word vector database specifically for legal texts requires a lot of computational time. In this embodiment, the legal text is used to update the existing open source word vector database (referred to as the second word vector database here) to obtain a word vector database for the legal text (referred to here as the first word vector database). Vector database) method, the specific process is shown in Figure 3:

Step S1041, perform word segmentation processing on each piece of legal text in the preset legal text database, to obtain each word that composes the legal text database.

The legal text database contains as many legal texts as possible in a certain statistical time period. The statistical time period can be set according to the actual situation, for example, it can be set to a time period within a week, a month, a quarter, or a year from the current moment.

The process of word segmentation is similar to the process in step S101. For details, please refer to the description in step S101, which will not be repeated here.

Step S1042: Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word.

The target word is any word that composes the legal text database. The related words are words whose intervals with the target words in the legal text database are less than a preset interval threshold. The interval threshold can be set according to actual conditions. For example, it can be set to 3 words, 5 words, 3 lines of text, 5 lines of text, 1 paragraph, 2 paragraphs or other values, etc. It should be noted that the target word may appear multiple times in the legal text database. As long as the interval between a word and the target word at any one time is less than the interval threshold, it can be regarded as the target word. Related words.

After each related word of the target word is determined, the first degree of relevance between the target word and each related word can be calculated according to the following formula:

Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of _{the target word, and ConNum c} is the effective frequency of the c-th related word of the target word, Assuming that the number of occurrences of the c-th related word in the legal text database is Num, and the interval between Num1 times and the closest target word is less than the interval threshold, the effective frequency of the c-th related word is Num1 , The remaining number (Num-Num1) is the invalid frequency, and FtConnect _c is the first degree of relevance between the target word and the c-th related word.

Step S1043: Query the word vector of the target word and the word vector of each related word in the preset second word vector database.

Step S1044: According to the first degree of relevance between the target word and each related word, and the word vector of each related word, update the word vector of the target word to obtain the updated word vector of the target word.

Specifically, the second degree of relevance between the target word and each related word may be calculated first according to the following formula:

Among them, d is the dimension number of the word vector, 1≤d≤DN, DN is the total number of dimensions of the word vector, TgtElm _d is the value of the word vector of the target word in the dth dimension, and CntElm _c,d is the value of the word vector. State the value of the word vector of the c-th related word of the target word in the d-th dimension, and SdConnect _c is the second degree of relevance between the target word and the c-th related word;

Then, calculate the correlation error between the target word and each related word according to the following formula:

ErrElm _c =SdConnect _c —FtConnect _c

Wherein, ErrElm _c is the relevance error between the target word and the c-th related word;

Finally, update the word vector of the target word according to the following formula:

Among them, λ is a preset update coefficient, and its value can be set according to the actual situation, for example, it can be set to 0.01, 0.001 or other values, etc., NwTgtElm _d is the update word vector of the target word in The value on the dth dimension.

Step S1045: Add the updated word vector of the target word into the first word vector database.

In this way, all the words in the legal text database are traversed, the word vectors of each word are updated to obtain the corresponding updated word vectors, and finally the first word vector is constructed from the updated word vectors of all words database.

Step S105: Calculate the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set.

Specifically, the vector distance between the core word subset and each feature word set can be calculated separately according to the following formula:

Where k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the sequence number of each storage partition, 1≤t≤TN, and e is each feature The word sequence number in the word set, 1≤e≤EN _t , EN _t is the total number of words in the t-th feature word set, and the t-th feature word set is the feature word set corresponding to the t-th storage partition, KeyElm _{k, d} is the value of the word vector of the k-th word in the core word subset in the d-th dimension, and EigElm _t,e,d is the word vector of the e-th word in the t-th feature word set. Values in d dimensions, Dis _t is the vector distance between the core word subset and the t-th feature word set.

Step S106: Store the legal text in the preferred storage partition.

The preferred storage partition is the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset. Specifically, the preferred storage partition to which the legal text belongs can be selected according to the following formula:

TgtLawDom=Argmin(DisSq)

=Argmin(Dis ₁ ,Dis ₂ ,......,Dis _t ,......,Dis _TN )

Among them, Argmin is the smallest independent variable function, DisSq is the vector distance sequence of the core word subset, and: DisSq=(Dis ₁ ,Dis ₂ ,......,Dis _t ,......, Dis _TN ), TgtLawDom is the serial number of the preferred storage partition to which the legal text belongs.

To sum up, in the embodiments of this application, the actual core content of the legal text is stored, and the legal texts with similar content will be stored in the same storage partition. When the user needs to query related materials, he only needs to store it in the corresponding storage. The search can be performed in the partition, which saves the labor cost and greatly improves the work efficiency.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Corresponding to a legal text storage method described in the above embodiment, FIG. 4 shows a structural diagram of an embodiment of a legal text storage device provided in an embodiment of the present application.

In this embodiment, a legal text storage device may include:

The legal text obtaining module 401 is configured to receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address;

The first word segmentation processing module 402 is configured to perform word segmentation processing on the legal text to obtain a set of words that make up the legal text;

The core word subset selection module 403 is configured to select a core word subset from the word set. The core word subset includes those whose term density is greater than a preset first threshold and their uniformity is greater than a preset second threshold Various words

The first word vector query module 404 is configured to obtain each feature word set corresponding to each storage partition, and respectively query the word vector of each word in the core word subset in the preset first word vector database, and The word vector of each word in each feature word set;

The vector distance calculation module 405 is configured to calculate the distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set. The vector distance;

The partition storage module 406 is configured to store the legal text in a preferred storage partition, which is the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.

Further, the legal text storage device may further include:

The second word segmentation processing module is used to perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that constitutes the legal text database;

The first degree of relevance calculation module is used to determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, and the target word is any one that composes the legal text database Words

The second word vector query module is used to query the word vector of the target word and the word vector of each related word in the preset second word vector database;

The update calculation module is used to update and calculate the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the update of the target word Word vector

The vector adding module is used to add the updated word vector of the target word to the first word vector database.

Further, the update calculation module may include:

The first calculation unit is configured to calculate the second degree of relevance between the target word and each related word;

The second calculation unit is used to calculate the correlation error between the target word and each related word respectively;

The third calculation unit is used to update and calculate the word vector of the target word.

Further, the core word subset selection module may include:

The term density calculation unit is used to calculate the term density of each word in the word set;

A uniformity calculation unit for calculating the uniformity of each word in the word set;

The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working processes of the above described devices, modules and units can refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

FIG. 5 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.

In this embodiment, the terminal device 5 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device 5 may include: a processor 50, a memory 51, and computer-readable instructions 52 stored in the memory 51 and running on the processor 50, for example, a computer-readable instruction that executes the foregoing legal text storage method instruction. When the processor 50 executes the computer-readable instructions 52, the steps in the foregoing legal text storage method embodiments are implemented, for example, steps S101 to S106 shown in FIG. 1. Alternatively, when the processor 50 executes the computer-readable instructions 52, the functions of the modules/units in the foregoing device embodiments, such as the functions of the modules 401 to 406 shown in FIG. 4, are implemented.

Exemplarily, the computer-readable instructions 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50, To complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 52 in the terminal device 5.

The processor 50 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk equipped on the terminal device 5, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 51 may also include both an internal storage unit of the terminal device 5 and an external storage device. The memory 51 is used to store the computer-readable instructions and other instructions and data required by the terminal device 5. The memory 51 can also be used to temporarily store data that has been output or will be output.

A person of ordinary skill in the art can understand that all or part of the processes in the method of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for storing legal text, which is characterized in that it includes:

Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that constitute the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector

Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;

The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
The legal text storage method according to claim 1, wherein the setting process of the first word vector database comprises:

Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database;

Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, where the target word is any word that composes the legal text database;

Respectively query the word vector of the target word and the word vector of each related word in the preset second word vector database;

Update the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the updated word vector of the target word;

The updated word vector of the target word is added to the first word vector database.
The legal text storage method according to claim 2, wherein said updating the word vector of the target word to obtain the updated word vector of the target word comprises:

Calculate the second degree of relevance between the target word and each related word according to the following formula:

Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;

Calculate the correlation error between the target word and each related word according to the following formula:

ErrElm c =SdConnect c —FtConnect c

Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;

The word vector of the target word is updated and calculated according to the following formula:

Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
The legal text storage method according to claim 1, wherein the calculating the vector distance between the core word subset and each feature word set respectively comprises:

Calculate the vector distance between the core word subset and each feature word set according to the following formula:

Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
The legal text storage method according to any one of claims 1 to 4, wherein selecting a core word subset from the word set comprises:

Calculate the entry density of each word in the word set according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;

Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.

Calculate the uniformity of each word in the word set according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
WdEqu w is the uniformity of the wth word in the word set;

Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
A legal text storage device, characterized in that it comprises:

A legal text acquisition module, configured to receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address;

The first word segmentation processing module is used to perform word segmentation processing on the legal text to obtain a set of words that constitute the legal text;

The core word subset selection module is configured to select a core word subset from the word set. The core word subset includes each item whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold Words

The first word vector query module is used to obtain each feature word set corresponding to each storage partition, and respectively query the word vector of each word in the core word subset in the preset first word vector database, and each The word vector of each word in the feature word set;

The vector distance calculation module is used to calculate the difference between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set. Vector distance

The partition storage module is configured to store the legal text in a preferred storage partition, the preferred storage partition being the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
The legal text storage device according to claim 6, further comprising:

The second word segmentation processing module is used to perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that constitutes the legal text database;

The first degree of relevance calculation module is used to determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, and the target word is any one that composes the legal text database Words

The second word vector query module is used to query the word vector of the target word and the word vector of each related word in the preset second word vector database;

The update calculation module is used to update and calculate the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the update of the target word Word vector

The vector adding module is used to add the updated word vector of the target word into the first word vector database.
The legal text storage device according to claim 7, wherein the update calculation module comprises:

The first calculation unit is configured to calculate the second degree of relevance between the target word and each related word according to the following formula:

Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d-th dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d-th dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;

The second calculation unit is configured to calculate the correlation error between the target word and each related word according to the following formula:

ErrElm c =SdConnect c —FtConnect c

Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;

The third calculation unit is used to update and calculate the word vector of the target word according to the following formula:

Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
The legal text storage device according to claim 6, wherein the vector distance calculation module is specifically configured to calculate the vector distance between the core word subset and each feature word set according to the following formula:

Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
The legal text storage device according to any one of claims 6 to 9, wherein the core word subset selection module comprises:

The term density calculation unit is used to calculate the term density of each word in the word set according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;

The text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;

The uniformity calculation unit is used to calculate the uniformity of each word in the word set according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
WdEqu w is the uniformity of the wth word in the word set;

The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
A computer non-volatile readable storage medium, the computer non-volatile readable storage medium storing computer readable instructions, characterized in that the computer readable instructions are executed by a processor to implement the following steps:

Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector

Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;

The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
The computer non-volatile readable storage medium according to claim 11, wherein the setting process of the first word vector database comprises:

Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database;

Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, where the target word is any word that composes the legal text database;

Respectively query the word vector of the target word and the word vector of each related word in the preset second word vector database;

Update the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the updated word vector of the target word;

The updated word vector of the target word is added to the first word vector database.
The computer non-volatile readable storage medium according to claim 12, wherein said updating the word vector of the target word to obtain the updated word vector of the target word comprises:

Calculate the second degree of relevance between the target word and each related word according to the following formula:

Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d-th dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d-th dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;

Calculate the correlation error between the target word and each related word according to the following formula:

ErrElm c =SdConnect c —FtConnect c

Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;

The word vector of the target word is updated and calculated according to the following formula:

Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
The computer non-volatile readable storage medium according to claim 11, wherein said calculating the vector distance between the core word subset and each feature word set respectively comprises:

Calculate the vector distance between the core word subset and each feature word set according to the following formula:

Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
The computer non-volatile readable storage medium according to any one of claims 11 to 14, wherein selecting a core word subset from the word set comprises:

Calculate the entry density of each word in the word set according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;

Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.

Calculate the uniformity of each word in the word set according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
WdEqu w is the uniformity of the wth word in the word set;

Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
A terminal device, comprising a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;

Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;

Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;

Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector

Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;

The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
The terminal device according to claim 16, wherein the setting process of the first word vector database comprises:

Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database;

Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, where the target word is any word that composes the legal text database;

Respectively query the word vector of the target word and the word vector of each related word in the preset second word vector database;

Update the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the updated word vector of the target word;

The updated word vector of the target word is added to the first word vector database.
The terminal device according to claim 17, wherein the updating and calculating the word vector of the target word to obtain the updated word vector of the target word comprises:

Calculate the second degree of relevance between the target word and each related word according to the following formula:

Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d-th dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d-th dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;

Calculate the correlation error between the target word and each related word according to the following formula:

ErrElm c =SdConnect c —FtConnect c

Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;

The word vector of the target word is updated and calculated according to the following formula:

Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
The terminal device according to claim 16, wherein the calculating the vector distance between the core word subset and each feature word set respectively comprises:

Calculate the vector distance between the core word subset and each feature word set according to the following formula:

Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
The terminal device according to any one of claims 16 to 19, wherein selecting a core word subset from the word set comprises:

Calculate the entry density of each word in the word set according to the following formula:

Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;

Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.

Calculate the uniformity of each word in the word set according to the following formula:

Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
WdEqu w is the uniformity of the wth word in the word set;

Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.