WO2021042511A1 - Legal text storage method and device, readable storage medium and terminal device - Google Patents

Legal text storage method and device, readable storage medium and terminal device Download PDF

Info

Publication number
WO2021042511A1
WO2021042511A1 PCT/CN2019/116635 CN2019116635W WO2021042511A1 WO 2021042511 A1 WO2021042511 A1 WO 2021042511A1 CN 2019116635 W CN2019116635 W CN 2019116635W WO 2021042511 A1 WO2021042511 A1 WO 2021042511A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
vector
target
subset
core
Prior art date
Application number
PCT/CN2019/116635
Other languages
French (fr)
Chinese (zh)
Inventor
周剀
周萌
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042511A1 publication Critical patent/WO2021042511A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • This application belongs to the field of computer technology, and in particular relates to a legal text storage method, device, computer non-volatile readable storage medium, and terminal equipment.
  • the embodiments of the present application provide a legal text storage method, device, computer non-volatile readable storage medium, and terminal equipment to solve the problem that the existing legal text storage is inconvenient for users to query.
  • the first aspect of the embodiments of the present application provides a legal text storage method, which may include:
  • the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
  • Word vector Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database.
  • the legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  • the second aspect of the embodiments of the present application provides a legal text storage device, which may include a module for implementing the steps of the foregoing legal text storage method.
  • a third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When realizing the steps of the above legal text storage method.
  • the fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer
  • the steps of the above legal text storage method are realized when the instructions are readable.
  • the actual core content of the legal text is stored, and legal texts with similar content will be stored in the same storage partition.
  • the user needs to query related materials, he only needs to search in the corresponding storage partition.
  • the labor cost is saved, and the work efficiency is greatly improved.
  • FIG. 1 is a flowchart of an embodiment of a method for storing legal text in an embodiment of this application
  • Figure 2 is a schematic flow chart of selecting a core word subset from a word set
  • FIG. 3 is a schematic flowchart of the setting process of the first word vector database
  • FIG. 4 is a structural diagram of an embodiment of a legal text storage device in an embodiment of the application.
  • Fig. 5 is a schematic block diagram of a terminal device in an embodiment of the application.
  • an embodiment of a method for storing legal text in an embodiment of the present application may include:
  • Step S101 Receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address.
  • the legal texts include, but are not limited to, texts in legal provisions, legal essays, legal reports, legal analysis articles, indictments, rulings, and other legal-related materials.
  • the legal text storage instruction carries the address where the legal text is currently located, that is, The target address.
  • the target address may be a certain storage address in the terminal device, or a certain storage address in the network or a designated database.
  • the terminal device is the implementation subject of this embodiment. After receiving the legal text storage instruction, the terminal device can extract the target address from it, and obtain the target address from the local, network, or designated address according to the target address.
  • the legal text is obtained from the database.
  • Step S102 Perform word segmentation processing on the legal text to obtain a set of words constituting the legal text.
  • the terminal device will first perform word segmentation processing on it to obtain a set of words that constitute the legal text.
  • Word segmentation refers to dividing the legal text into individual words.
  • the general dictionary and the legal dictionary can be combined to segment the legal text, that is, the legal dictionary is used to split the legal text.
  • the legal text is segmented in the first round, and then the general dictionary is used to segment the remaining legal texts after the first round of segmentation.
  • the legal-specific terms are firstly segmented, and then the general terms are segmented.
  • single words are separated.
  • Step S103 Select a core word subset from the word set.
  • the core word subset includes each word whose term density is greater than the preset first threshold and the uniformity is greater than the preset second threshold.
  • step S103 may specifically include the following steps:
  • Step S1031 respectively calculate the entry density of each word in the word set.
  • the entry density of each word in the word set can be calculated according to the following formula:
  • WdNum w is the wth word in the word set in the legal text
  • LineNum is the total number of lines of the legal text
  • WdDensity w is the entry density of the w-th word in the word set.
  • Step S1032 Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph.
  • each KN line in the legal text can be regarded as a text paragraph, that is, the first line to the KN line in the legal text As the first text paragraph, take line KN+1 to line 2 ⁇ KN in the legal text as the second text paragraph, and change line 2 ⁇ KN+1 to line 3 ⁇ in the legal text Line KN is used as the third text paragraph, and so on.
  • Ceil is a round-up function.
  • the value of KN can be set according to specific conditions, for example, it can be set to 3, 5, 10 or other values and so on.
  • Step S1033 Calculate the uniformity of each word in the word set respectively.
  • the uniformity of each word in the word set can be calculated according to the following formula:
  • f is the serial number of each text paragraph of the legal text, 1 ⁇ f ⁇ FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And WdEqu w is the uniformity of the wth word in the word set.
  • Step S1034 Select each word with a word density greater than the first threshold and a uniformity greater than the second threshold from the word set to form the core word subset.
  • the specific values of the first threshold and the second threshold may be set according to actual conditions.
  • the following entry density sequence can be constructed first according to the order of value from largest to smallest:
  • DensitySet ⁇ WdDensity 1 , WdDensity 2 , ..., WdDensity w , ..., WdDensity WN ⁇
  • DensitySet is the term density sequence.
  • MaxDensitySet ⁇ MaxWdDensity 1 , MaxWdDensity 2 , ..., MaxWdDensity nmax , ..., MaxWdDensity MaxNum ⁇
  • MaxDensitySet is the maximum entry density sequence
  • MaxNum is the number of values in the maximum entry density sequence
  • MaxNum WN ⁇ 1
  • ⁇ 1 is the first selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values
  • nmax is the value sequence number in the maximum entry density sequence
  • MaxWdDensity nmax is the nmax of the maximum entry density sequence value.
  • MinDensitySet ⁇ MinWdDensity 1 , MinWdDensity 2 , ..., MinWdDensity nmin , ..., MinWdDensity MinNum ⁇
  • MinDensitySet is the minimum entry density sequence
  • MinNum is the number of values in the minimum entry density sequence
  • MaxNum WN ⁇ 2
  • ⁇ 2 is the second selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values
  • nmin is the value sequence number in the minimum entry density sequence
  • MinWdDensity nmin is the nminth value of the minimum entry density sequence value.
  • MidDensitySet ⁇ MidWdDensity 1 , MidWdDensity 2 , ..., MidWdDensity nmid , ..., MidWdDensity MidNum ⁇
  • MidDensitySet is the median term density sequence
  • MidDensitySet DensitySet-MaxDensitySet-MinDensitySet
  • MidNum is the number of values in the median term density sequence
  • MidNum WN ⁇ (1- ⁇ 1- ⁇ 2 )
  • nmid is the value sequence number in the median entry density sequence, 1 ⁇ nmid ⁇ MidNum
  • MidWdDensity nmid is the nmid value in the median entry density sequence.
  • the setting process of the second threshold is similar to the setting process of the first threshold. It is only necessary to replace the density of entries appearing therein with uniformity. For details, please refer to the above content, which will not be repeated here.
  • Step S104 Obtain each feature word set corresponding to each storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Words vector of words.
  • all legal texts can be divided into multiple storage partitions according to actual conditions.
  • the total number of storage partitions is recorded as TN.
  • the corresponding feature word set can be set in advance.
  • the feature word set corresponding to the civil storage partition can be set as: ⁇ civil, company, contract, liability, loan, compensation, interest, accident, insurance ⁇
  • set the characteristic word set corresponding to the criminal storage partition as: ⁇ criminal, criminal, fixed-term imprisonment, life imprisonment, victim, sentence ⁇
  • set the characteristic word set corresponding to the administrative storage partition as: ⁇ administration, government, procedure , Trademark, property ⁇ .
  • Any word vector database is a database that records the correspondence between words and word vectors.
  • the word vector may be a corresponding word vector obtained by training the word according to the word2vec model. That is, the probability of occurrence of the word is expressed according to the context information of the word.
  • the training of word vectors is still based on the idea of word2vec. First, each word is represented as a 0-1 vector (one-hot) form, and then the word2vec model is trained with the word vector, and n-1 words are used to predict the nth word , The intermediate process obtained after the neural network model prediction is used as the word vector.
  • the one-hot vector of "celebration” is assumed to be [1,0,0,0,...,0] and the one-hot vector of "meeting” is [0,1,0,0,... ...,0], the one-hot vector for "smooth” is [0,0,1,0, whil,0], the vector for predicting "closing” [0,0,0,1,>,0],
  • the model is trained to generate the coefficient matrix W of the hidden layer.
  • the product of the one-hot vector of each word and the coefficient matrix is the word vector of the word.
  • the final form will be similar to "Celebrate [-0.28,0.34,-0.02, ......,0.92]" such a multi-dimensional vector.
  • the legal text is used to update the existing open source word vector database (referred to as the second word vector database here) to obtain a word vector database for the legal text (referred to here as the first word vector database).
  • Vector database method, the specific process is shown in Figure 3:
  • Step S1041 perform word segmentation processing on each piece of legal text in the preset legal text database, to obtain each word that composes the legal text database.
  • the legal text database contains as many legal texts as possible in a certain statistical time period.
  • the statistical time period can be set according to the actual situation, for example, it can be set to a time period within a week, a month, a quarter, or a year from the current moment.
  • step S101 The process of word segmentation is similar to the process in step S101.
  • steps S101 please refer to the description in step S101, which will not be repeated here.
  • Step S1042 Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word.
  • the target word is any word that composes the legal text database.
  • the related words are words whose intervals with the target words in the legal text database are less than a preset interval threshold.
  • the interval threshold can be set according to actual conditions. For example, it can be set to 3 words, 5 words, 3 lines of text, 5 lines of text, 1 paragraph, 2 paragraphs or other values, etc. It should be noted that the target word may appear multiple times in the legal text database. As long as the interval between a word and the target word at any one time is less than the interval threshold, it can be regarded as the target word.
  • the target word may appear multiple times in the legal text database. As long as the interval between a word and the target word at any one time is less than the interval threshold, it can be regarded as the target word.
  • the first degree of relevance between the target word and each related word can be calculated according to the following formula:
  • c is the serial number of each related word of the target word, 1 ⁇ c ⁇ CN, CN is the total number of related words of the target word, and ConNum c is the effective frequency of the c-th related word of the target word, Assuming that the number of occurrences of the c-th related word in the legal text database is Num, and the interval between Num1 times and the closest target word is less than the interval threshold, the effective frequency of the c-th related word is Num1 , The remaining number (Num-Num1) is the invalid frequency, and FtConnect c is the first degree of relevance between the target word and the c-th related word.
  • Step S1043 Query the word vector of the target word and the word vector of each related word in the preset second word vector database.
  • Step S1044 According to the first degree of relevance between the target word and each related word, and the word vector of each related word, update the word vector of the target word to obtain the updated word vector of the target word.
  • the second degree of relevance between the target word and each related word may be calculated first according to the following formula:
  • d is the dimension number of the word vector, 1 ⁇ d ⁇ DN, DN is the total number of dimensions of the word vector, TgtElm d is the value of the word vector of the target word in the dth dimension, and CntElm c,d is the value of the word vector.
  • SdConnect c is the second degree of relevance between the target word and the c-th related word;
  • ErrElm c is the relevance error between the target word and the c-th related word
  • is a preset update coefficient, and its value can be set according to the actual situation, for example, it can be set to 0.01, 0.001 or other values, etc.
  • NwTgtElm d is the update word vector of the target word in The value on the dth dimension.
  • Step S1045 Add the updated word vector of the target word into the first word vector database.
  • the word vectors of each word are updated to obtain the corresponding updated word vectors, and finally the first word vector is constructed from the updated word vectors of all words database.
  • Step S105 Calculate the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set.
  • the vector distance between the core word subset and each feature word set can be calculated separately according to the following formula:
  • k is the word sequence number in the core word subset, 1 ⁇ k ⁇ KN, KN is the total number of words in the core word subset, t is the sequence number of each storage partition, 1 ⁇ t ⁇ TN, and e is each feature
  • the word sequence number in the word set, 1 ⁇ e ⁇ EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set is the feature word set corresponding to the t-th storage partition
  • KeyElm k d is the value of the word vector of the k-th word in the core word subset in the d-th dimension
  • EigElm t,e,d is the word vector of the e-th word in the t-th feature word set. Values in d dimensions
  • Dis t is the vector distance between the core word subset and the t-th feature word set.
  • Step S106 Store the legal text in the preferred storage partition.
  • the preferred storage partition is the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  • the preferred storage partition to which the legal text belongs can be selected according to the following formula:
  • TgtLawDom Argmin(DisSq)
  • Argmin is the smallest independent variable function
  • DisSq is the vector distance sequence of the core word subset
  • DisSq (Dis 1 ,Dis 2 , whil,Dis t , whil, Dis TN )
  • TgtLawDom is the serial number of the preferred storage partition to which the legal text belongs.
  • the actual core content of the legal text is stored, and the legal texts with similar content will be stored in the same storage partition.
  • the user needs to query related materials, he only needs to store it in the corresponding storage.
  • the search can be performed in the partition, which saves the labor cost and greatly improves the work efficiency.
  • FIG. 4 shows a structural diagram of an embodiment of a legal text storage device provided in an embodiment of the present application.
  • a legal text storage device may include:
  • the legal text obtaining module 401 is configured to receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address;
  • the first word segmentation processing module 402 is configured to perform word segmentation processing on the legal text to obtain a set of words that make up the legal text;
  • the core word subset selection module 403 is configured to select a core word subset from the word set.
  • the core word subset includes those whose term density is greater than a preset first threshold and their uniformity is greater than a preset second threshold Various words
  • the first word vector query module 404 is configured to obtain each feature word set corresponding to each storage partition, and respectively query the word vector of each word in the core word subset in the preset first word vector database, and The word vector of each word in each feature word set;
  • the vector distance calculation module 405 is configured to calculate the distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set. The vector distance;
  • the partition storage module 406 is configured to store the legal text in a preferred storage partition, which is the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  • the legal text storage device may further include:
  • the second word segmentation processing module is used to perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that constitutes the legal text database;
  • the first degree of relevance calculation module is used to determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, and the target word is any one that composes the legal text database Words
  • the second word vector query module is used to query the word vector of the target word and the word vector of each related word in the preset second word vector database;
  • the update calculation module is used to update and calculate the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the update of the target word Word vector
  • the vector adding module is used to add the updated word vector of the target word to the first word vector database.
  • update calculation module may include:
  • the first calculation unit is configured to calculate the second degree of relevance between the target word and each related word
  • the second calculation unit is used to calculate the correlation error between the target word and each related word respectively;
  • the third calculation unit is used to update and calculate the word vector of the target word.
  • the core word subset selection module may include:
  • the term density calculation unit is used to calculate the term density of each word in the word set
  • a uniformity calculation unit for calculating the uniformity of each word in the word set
  • the core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
  • FIG. 5 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
  • the terminal device 5 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device 5 may include: a processor 50, a memory 51, and computer-readable instructions 52 stored in the memory 51 and running on the processor 50, for example, a computer-readable instruction that executes the foregoing legal text storage method instruction.
  • the processor 50 executes the computer-readable instructions 52
  • the steps in the foregoing legal text storage method embodiments are implemented, for example, steps S101 to S106 shown in FIG. 1.
  • the processor 50 executes the computer-readable instructions 52
  • the functions of the modules/units in the foregoing device embodiments such as the functions of the modules 401 to 406 shown in FIG. 4, are implemented.
  • the computer-readable instructions 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50, To complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 52 in the terminal device 5.
  • the processor 50 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5.
  • the memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk equipped on the terminal device 5, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 51 may also include both an internal storage unit of the terminal device 5 and an external storage device.
  • the memory 51 is used to store the computer-readable instructions and other instructions and data required by the terminal device 5.
  • the memory 51 can also be used to temporarily store data that has been output or will be output.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

Provided are a legal text storage method and device, a computer non-volatile readable storage medium and a terminal device. According to the method, after a related instruction is received, a legal text is automatically obtained, a core word subset that can effectively represent the core content of the legal text is automatically selected from the legal text through automatic text analysis, a vector distance between the core word subset and each feature word set is calculated by means of word vectors, the vector distance is taken as the basis for determining a storage partition in which the legal text should be stored, the storage partition corresponding to the feature word set with the minimum vector distance from the core word subset is selected as a preferred storage partition, and the legal text is stored in the preferred storage partition. When a user needs to query related material, the user only needs to search in the corresponding storage partition, which saves labor costs, and greatly improves the working efficiency.

Description

一种法律文本存储方法、装置、可读存储介质及终端设备Method, device, readable storage medium and terminal equipment for storing legal text
本申请要求于2019年9月3日提交中国专利局、申请号为201910826805.4、发明名称为“一种法律文本存储方法、装置、可读存储介质及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 3, 2019, the application number is 201910826805.4, and the invention title is "a legal text storage method, device, readable storage medium, and terminal equipment". The entire content is incorporated into this application by reference.
技术领域Technical field
本申请属于计算机技术领域,尤其涉及一种法律文本存储方法、装置、计算机非易失性可读存储介质及终端设备。This application belongs to the field of computer technology, and in particular relates to a legal text storage method, device, computer non-volatile readable storage medium, and terminal equipment.
背景技术Background technique
法律从业人员在日常的法律工作中往往会积累大量的法律文本,现有技术中提供了多种对这些法律文本进行有序存储的方法,例如,可以按照时间、大小、名称等进行升序或降序的存储。这样的存储方法虽然可以使得这些法律文本看起来井然有序,但却并未考虑到这些法律文本内在的关联性,不便于用户进行查询,当用户需要从中查询相关的资料时,往往需要逐个进行查看,耗费大量的人力成本,效率极为低下。Legal practitioners tend to accumulate a large number of legal texts in their daily legal work. The prior art provides a variety of methods to store these legal texts in an orderly manner. For example, they can be sorted in ascending or descending order according to time, size, name, etc. Storage. Although this storage method can make these legal texts look orderly, it does not take into account the inherent relevance of these legal texts, which is not convenient for users to query. When users need to query related materials, they often need to do it one by one. Checking, consumes a lot of manpower costs, and is extremely inefficient.
技术问题technical problem
有鉴于此,本申请实施例提供了一种法律文本存储方法、装置、计算机非易失性可读存储介质及终端设备,以解决现有的法律文本存储不便于用户进行查询的问题。In view of this, the embodiments of the present application provide a legal text storage method, device, computer non-volatile readable storage medium, and terminal equipment to solve the problem that the existing legal text storage is inconvenient for users to query.
技术解决方案Technical solutions
本申请实施例的第一方面提供了一种法律文本存储方法,可以包括:The first aspect of the embodiments of the present application provides a legal text storage method, which may include:
接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;
对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;
从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
分别获取与各个预设的存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量;Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector
根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离;Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;
将所述法律文本存储入优选存储分区中,所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
本申请实施例的第二方面提供了一种法律文本存储装置,可以包括用于实现上述法律文本存储方法的步骤的模块。The second aspect of the embodiments of the present application provides a legal text storage device, which may include a module for implementing the steps of the foregoing legal text storage method.
本申请实施例的第三方面提供了一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述法律文本存储方法的步骤。A third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When realizing the steps of the above legal text storage method.
本申请实施例的第四方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述法律文本存储方法的步骤。The fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer The steps of the above legal text storage method are realized when the instructions are readable.
有益效果Beneficial effect
在本申请实施例中,按照法律文本的实际核心内容进行存储,内容相似的法律文本会被存储入同一存储分区中,当用户需要查询相关资料时,仅需在对应的存储分区中进行查找即可,节省了对于人力成本的耗费,大大提高了工作效率。In the embodiment of this application, the actual core content of the legal text is stored, and legal texts with similar content will be stored in the same storage partition. When the user needs to query related materials, he only needs to search in the corresponding storage partition. However, the labor cost is saved, and the work efficiency is greatly improved.
附图说明Description of the drawings
图1为本申请实施例中一种法律文本存储方法的一个实施例流程图;FIG. 1 is a flowchart of an embodiment of a method for storing legal text in an embodiment of this application;
图2为从词语集合中选取核心词子集的示意流程图;Figure 2 is a schematic flow chart of selecting a core word subset from a word set;
图3为第一词语向量数据库的设置过程的示意流程图;FIG. 3 is a schematic flowchart of the setting process of the first word vector database;
图4为本申请实施例中一种法律文本存储装置的一个实施例结构图;4 is a structural diagram of an embodiment of a legal text storage device in an embodiment of the application;
图5为本申请实施例中一种终端设备的示意框图。Fig. 5 is a schematic block diagram of a terminal device in an embodiment of the application.
本发明的实施方式Embodiments of the present invention
请参阅图1,本申请实施例中一种法律文本存储方法的一个实施例可以包括:Referring to FIG. 1, an embodiment of a method for storing legal text in an embodiment of the present application may include:
步骤S101、接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本。Step S101: Receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address.
所述法律文本包括但不限于法律条文、法律论文、法律报道、法律分析文章以及法院的起诉书、裁决书等等与法律相关的材料中的文本。The legal texts include, but are not limited to, texts in legal provisions, legal essays, legal reports, legal analysis articles, indictments, rulings, and other legal-related materials.
当用户需要对某一法律文本进行存储时,可以通过人机交互界面向预设的终端设备下发法律文本存储指令,在所述法律文本存储指令中携带着法律文本当前所在的地址,也即所述目标地址。所述目标地址可以是所述终端设备中的某一存储地址,也可以是网络中或者指定的数据库中的某一存储地址。所述终端设备即为本实施例的实施主体,在接收到所述法律文本存储指令之后,所述终端设备可以从中提取出所述目标地址,并根据所述目标地址从本地、网络或者指定的数据库中获取到法律文本。When a user needs to store a certain legal text, he can issue a legal text storage instruction to a preset terminal device through a human-computer interaction interface. The legal text storage instruction carries the address where the legal text is currently located, that is, The target address. The target address may be a certain storage address in the terminal device, or a certain storage address in the network or a designated database. The terminal device is the implementation subject of this embodiment. After receiving the legal text storage instruction, the terminal device can extract the target address from it, and obtain the target address from the local, network, or designated address according to the target address. The legal text is obtained from the database.
步骤S102、对所述法律文本进行分词处理,得到组成所述法律文本的词语集合。Step S102: Perform word segmentation processing on the legal text to obtain a set of words constituting the legal text.
在进行法律文本存储的过程中,所述终端设备首先会对会对其进行分词处理,得到组成所述法律文本的词语集合。分词处理是指将所述法律文本切分成一个一个单独的词语,在本实施例中,可以采用通用词典与法律专用词典相结合的方式对所述法律 文本进行切分,即使用法律专用词典对所述法律文本进行第一轮切分,再使用通用词典对第一轮切分后剩下的法律文本进行切分,通过这样的方式,优先切分出法律专用词语,再切分出通用词语,对于既无法切分出法律专用词语又无法切分出通用词语的法律文本,则切分出单字。In the process of storing the legal text, the terminal device will first perform word segmentation processing on it to obtain a set of words that constitute the legal text. Word segmentation refers to dividing the legal text into individual words. In this embodiment, the general dictionary and the legal dictionary can be combined to segment the legal text, that is, the legal dictionary is used to split the legal text. The legal text is segmented in the first round, and then the general dictionary is used to segment the remaining legal texts after the first round of segmentation. In this way, the legal-specific terms are firstly segmented, and then the general terms are segmented. , For legal texts that cannot be distinguished neither legal terms nor general terms, single words are separated.
步骤S103、从所述词语集合中选取核心词子集。Step S103: Select a core word subset from the word set.
所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语。The core word subset includes each word whose term density is greater than the preset first threshold and the uniformity is greater than the preset second threshold.
如图2所示,步骤S103具体可以包括如下步骤:As shown in FIG. 2, step S103 may specifically include the following steps:
步骤S1031、分别计算所述词语集合中的各个词语的词条密度。Step S1031, respectively calculate the entry density of each word in the word set.
具体地,可以根据下式分别计算所述词语集合中的各个词语的词条密度:Specifically, the entry density of each word in the word set can be calculated according to the following formula:
Figure PCTCN2019116635-appb-000001
Figure PCTCN2019116635-appb-000001
其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合中的词语数目,WdNum w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity w为所述词语集合中的第w个词语的词条密度。 Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines of the legal text, and WdDensity w is the entry density of the w-th word in the word set.
步骤S1032、将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况。Step S1032: Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph.
FN为大于1的整数。文本段落可以根据具体情况来划分,在本实施例的一种具体实现中,可以将所述法律文本中的每KN行作为一个文本段落,即将所述法律文本中的第1行至第KN行作为第一个文本段落,将所述法律文本中的第KN+1行至第2×KN行作为第二个文本段落,将所述法律文本中的第2×KN+1行至第3×KN行作为第三个文本段落,以此类推。则有:
Figure PCTCN2019116635-appb-000002
其中,Ceil为向上取整函数。KN的取值可以根据具体情况进行设置,例如,可以将其设置为3、5、10或者其它取值等等。
FN is an integer greater than 1. The text paragraphs can be divided according to specific conditions. In a specific implementation of this embodiment, each KN line in the legal text can be regarded as a text paragraph, that is, the first line to the KN line in the legal text As the first text paragraph, take line KN+1 to line 2×KN in the legal text as the second text paragraph, and change line 2×KN+1 to line 3× in the legal text Line KN is used as the third text paragraph, and so on. Then there are:
Figure PCTCN2019116635-appb-000002
Among them, Ceil is a round-up function. The value of KN can be set according to specific conditions, for example, it can be set to 3, 5, 10 or other values and so on.
步骤S1033、分别计算所述词语集合中的各个词语的均匀度。Step S1033: Calculate the uniformity of each word in the word set respectively.
具体地,可以根据下式分别计算所述词语集合中的各个词语的均匀度:Specifically, the uniformity of each word in the word set can be calculated according to the following formula:
Figure PCTCN2019116635-appb-000003
Figure PCTCN2019116635-appb-000003
其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag w,f为所述词语集 合中的第w个词语在第f个文本段落中的出现情况的标志位,且
Figure PCTCN2019116635-appb-000004
WdEqu w为所述词语集合中的第w个词语的均匀度。
Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
Figure PCTCN2019116635-appb-000004
WdEqu w is the uniformity of the wth word in the word set.
步骤S1034、从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Step S1034: Select each word with a word density greater than the first threshold and a uniformity greater than the second threshold from the word set to form the core word subset.
所述第一阈值和所述第二阈值的具体取值可以根据实际情况进行设置。The specific values of the first threshold and the second threshold may be set according to actual conditions.
在本实施例的一种具体实现中,可以首先按照取值从大到小的顺序构造如下所示的词条密度序列:In a specific implementation of this embodiment, the following entry density sequence can be constructed first according to the order of value from largest to smallest:
DensitySet={WdDensity 1、WdDensity 2、……、WdDensity w、……、WdDensity WN} DensitySet={WdDensity 1 , WdDensity 2 , ..., WdDensity w , ..., WdDensity WN }
其中,DensitySet即为所述词条密度序列。Wherein, DensitySet is the term density sequence.
然后,按照预设的第一选取比例从所述词条密度序列中选取排序在前的若干个取值,并将选取的取值构造为如下所示的最大词条密度序列:Then, according to the preset first selection ratio, select several values ranked first from the term density sequence, and construct the selected values into the maximum term density sequence as shown below:
MaxDensitySet={MaxWdDensity 1、MaxWdDensity 2、……、MaxWdDensity nmax、……、MaxWdDensity MaxNum} MaxDensitySet={MaxWdDensity 1 , MaxWdDensity 2 , ..., MaxWdDensity nmax , ..., MaxWdDensity MaxNum }
其中,MaxDensitySet为所述最大词条密度序列,MaxNum为所述最大词条密度序列中的取值个数,且MaxNum=WN×η 1,η 1为所述第一选取比例,可以根据实际情况将其设置为0.2、0.3、0.4或者其它取值,nmax为所述最大词条密度序列中的取值序号,1≤nmax≤MaxNum,MaxWdDensity nmax为所述最大词条密度序列的第nmax个取值。 Wherein, MaxDensitySet is the maximum entry density sequence, MaxNum is the number of values in the maximum entry density sequence, and MaxNum=WN×η 1 , η 1 is the first selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmax is the value sequence number in the maximum entry density sequence, 1≤nmax≤MaxNum, MaxWdDensity nmax is the nmax of the maximum entry density sequence value.
接着,按照预设的第二选取比例从所述词条密度序列中选取排序在后的若干个取值,并将选取的取值构造为如下所示的最小词条密度序列:Then, according to a preset second selection ratio, select several values that are ranked in the end of the term density sequence, and construct the selected values into the minimum term density sequence as shown below:
MinDensitySet={MinWdDensity 1、MinWdDensity 2、……、MinWdDensity nmin、……、MinWdDensity MinNum} MinDensitySet={MinWdDensity 1 , MinWdDensity 2 , ..., MinWdDensity nmin , ..., MinWdDensity MinNum }
其中,MinDensitySet为所述最小词条密度序列,MinNum为所述最小词条密度序列中的取值个数,且MaxNum=WN×η 2,η 2为所述第二选取比例,可以根据实际情况将其设置为0.2、0.3、0.4或者其它取值,nmin为所述最小词条密度序列中的取值序号,1≤nmin≤MinNum,MinWdDensity nmin为所述最小词条密度序列的第nmin个取值。 Wherein, MinDensitySet is the minimum entry density sequence, MinNum is the number of values in the minimum entry density sequence, and MaxNum=WN×η 2 , η 2 is the second selection ratio, which can be based on actual conditions Set it to 0.2, 0.3, 0.4 or other values, nmin is the value sequence number in the minimum entry density sequence, 1≤nmin≤MinNum, and MinWdDensity nmin is the nminth value of the minimum entry density sequence value.
再构造如下所示的中值词条密度序列:Then construct the median term density sequence as shown below:
MidDensitySet={MidWdDensity 1、MidWdDensity 2、……、MidWdDensity nmid、……、MidWdDensity MidNum} MidDensitySet={MidWdDensity 1 , MidWdDensity 2 , ..., MidWdDensity nmid , ..., MidWdDensity MidNum }
其中,MidDensitySet为所述中值词条密度序列,且MidDensitySet=DensitySet-MaxDensitySet-MinDensitySet,MidNum为所述中值词条密度序列中的取值个数,且MidNum=WN×(1-η 12),nmid为所述中值词条密度序列中的取值序号,1≤nmid≤MidNum,MidWdDensity nmid为所述中值词条密度序列的第nmid个取值。 Wherein, MidDensitySet is the median term density sequence, and MidDensitySet=DensitySet-MaxDensitySet-MinDensitySet, MidNum is the number of values in the median term density sequence, and MidNum=WN×(1-η 1- η 2 ), nmid is the value sequence number in the median entry density sequence, 1≤nmid≤MidNum, and MidWdDensity nmid is the nmid value in the median entry density sequence.
最后,根据下式计算所述第一阈值:Finally, calculate the first threshold according to the following formula:
Figure PCTCN2019116635-appb-000005
Figure PCTCN2019116635-appb-000005
其中,λ为预设的系数,且λ>0,FstThresh为所述第一阈值。Where λ is a preset coefficient, and λ>0, FstThresh is the first threshold.
所述第二阈值的设置过程与所述第一阈值的设置过程类似,仅需将其中出现的词条密度替换为均匀度即可,具体可参照上述内容,此处不再赘述。The setting process of the second threshold is similar to the setting process of the first threshold. It is only necessary to replace the density of entries appearing therein with uniformity. For details, please refer to the above content, which will not be repeated here.
步骤S104、分别获取与各个存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量。Step S104: Obtain each feature word set corresponding to each storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Words vector of words.
在本实施例中,可以根据实际情况将所有的法律文本划分为多个存储分区,此处将存储分区的总数记为TN。例如,可以将所有的法律文本划分为民事、刑事、行政这三个存储分区,即TN=3。In this embodiment, all legal texts can be divided into multiple storage partitions according to actual conditions. Here, the total number of storage partitions is recorded as TN. For example, all legal texts can be divided into three storage areas: civil, criminal, and administrative, that is, TN=3.
对于每一个存储分区,均可预先设置对应的特征词集合,例如,可以将与民事存储分区对应的特征词集合设置为:{民事、公司、合同、责任、借款、赔偿、利息、事故、保险},将与刑事存储分区对应的特征词集合设置为:{刑事、罪犯、有期徒刑、无期徒刑、被害人、刑期},将与行政存储分区对应的特征词集合设置为:{行政、政府、程序、商标、财产},需要注意的是,以上仅为特征词集合设置的一个具体示例,实际应用中还可以根据实际情况设置其它的特征词集合,本实施例对此不做具体限定。For each storage partition, the corresponding feature word set can be set in advance. For example, the feature word set corresponding to the civil storage partition can be set as: {civil, company, contract, liability, loan, compensation, interest, accident, insurance }, set the characteristic word set corresponding to the criminal storage partition as: {criminal, criminal, fixed-term imprisonment, life imprisonment, victim, sentence}, and set the characteristic word set corresponding to the administrative storage partition as: {administration, government, procedure , Trademark, property}. It should be noted that the above is only a specific example of the feature word set setting. In practical applications, other feature word sets can also be set according to the actual situation, which is not specifically limited in this embodiment.
任意一个词语向量数据库为记录词语与词语向量之间的对应关系的数据库。所述词语向量可以是根据word2vec模型训练词语所得到对应的词语向量。即根据词语的上下文信息来表示该词出现的概率。词语向量的训练依然按照word2vec的思想,先将每个词表示成一个0-1向量(one-hot)形式,再用词语向量进行word2vec模型训练,用 n-1个词来预测第n个词,神经网络模型预测后得到的中间过程作为词语向量。具体地,如“庆祝”的one-hot向量假设定为[1,0,0,0,……,0],“大会”的one-hot向量为[0,1,0,0,……,0],“顺利”的one-hot向量为[0,0,1,0,……,0],预测“闭幕”的向量[0,0,0,1,……,0],模型经过训练会生成隐藏层的系数矩阵W,每个词的one-hot向量和系数矩阵的乘积为该词的词语向量,最后的形式将是类似于“庆祝[-0.28,0.34,-0.02,…...,0.92]”这样的一个多维向量。Any word vector database is a database that records the correspondence between words and word vectors. The word vector may be a corresponding word vector obtained by training the word according to the word2vec model. That is, the probability of occurrence of the word is expressed according to the context information of the word. The training of word vectors is still based on the idea of word2vec. First, each word is represented as a 0-1 vector (one-hot) form, and then the word2vec model is trained with the word vector, and n-1 words are used to predict the nth word , The intermediate process obtained after the neural network model prediction is used as the word vector. Specifically, for example, the one-hot vector of "celebration" is assumed to be [1,0,0,0,...,0], and the one-hot vector of "meeting" is [0,1,0,0,... …,0], the one-hot vector for "smooth" is [0,0,1,0,……,0], the vector for predicting "closing" [0,0,0,1,……,0], The model is trained to generate the coefficient matrix W of the hidden layer. The product of the one-hot vector of each word and the coefficient matrix is the word vector of the word. The final form will be similar to "Celebrate [-0.28,0.34,-0.02, …...,0.92]" such a multi-dimensional vector.
现有技术中提供了很多开源的词语向量数据库,但这些词语向量数据库是通用于各个领域的,并非为法律文本专门设置,因此如果直接使用会降低最终分类结果的准确率,而如果根据word2vec模型重新训练一个专门针对法律文本的词语向量数据库,又需要耗费大量的计算时间。在本实施例中,采取了使用法律文本对已有的开源的词语向量数据库(此处记为第二词语向量数据库)进行更新,得到针对法律文本的词语向量数据库(此处记为第一词语向量数据库)的方法,具体过程如图3所示:Many open-source word vector databases are provided in the prior art, but these word vector databases are commonly used in various fields and are not specially set up for legal texts. Therefore, if they are used directly, the accuracy of the final classification results will be reduced, and if they are used according to the word2vec model Retraining a word vector database specifically for legal texts requires a lot of computational time. In this embodiment, the legal text is used to update the existing open source word vector database (referred to as the second word vector database here) to obtain a word vector database for the legal text (referred to here as the first word vector database). Vector database) method, the specific process is shown in Figure 3:
步骤S1041、对预设的法律文本库中的各条法律文本进行分词处理,得到组成所述法律文本库的各个词语。Step S1041, perform word segmentation processing on each piece of legal text in the preset legal text database, to obtain each word that composes the legal text database.
在所述法律文本库中尽可能多的包含某一统计时间段内获取的所有法律文本。该统计时间段可以根据实际情况进行设置,例如,可以将其设置为距离当前时刻一周、一个月、一个季度或者一年内的时间段。The legal text database contains as many legal texts as possible in a certain statistical time period. The statistical time period can be set according to the actual situation, for example, it can be set to a time period within a week, a month, a quarter, or a year from the current moment.
分词处理的过程与步骤S101中的过程类似,具体可参照步骤S101中的叙述,此处不再赘述。The process of word segmentation is similar to the process in step S101. For details, please refer to the description in step S101, which will not be repeated here.
步骤S1042、确定目标词语的各个关联词语,并分别计算所述目标词语与各个关联词语之间的第一关联度。Step S1042: Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word.
所述目标词语为组成所述法律文本库的任意一个词语。所述关联词语为在所述法律文本库中与所述目标词语的间隔小于预设的间隔阈值的词语,所述间隔阈值可以根据实际情况进行设置,例如,可以将其设置为3个词语、5个词语、3行文本、5行文本、1个段落、2个段落或者其它取值等等。需要注意的是,所述目标词语可能会在所述法律文本库中多次出现,某一词语只要在任意一次与所述目标词语的间隔小于所述间隔阈值,即可作为所述目标词语的关联词语。The target word is any word that composes the legal text database. The related words are words whose intervals with the target words in the legal text database are less than a preset interval threshold. The interval threshold can be set according to actual conditions. For example, it can be set to 3 words, 5 words, 3 lines of text, 5 lines of text, 1 paragraph, 2 paragraphs or other values, etc. It should be noted that the target word may appear multiple times in the legal text database. As long as the interval between a word and the target word at any one time is less than the interval threshold, it can be regarded as the target word. Related words.
在确定出所述目标词语的各个关联词语之后,可以根据下式分别计算所述目标词语与各个关联词语之间的第一关联度:After each related word of the target word is determined, the first degree of relevance between the target word and each related word can be calculated according to the following formula:
Figure PCTCN2019116635-appb-000006
Figure PCTCN2019116635-appb-000006
其中,c为所述目标词语的各个关联词语的序号,1≤c≤CN,CN为所述目标词语 的关联词语的总数,ConNum c为所述目标词语的第c个关联词语的有效频次,假设第c个关联词语在所述法律文本库出现的次数为Num,其中有Num1次与最接近的所述目标词语的间隔小于所述间隔阈值,则第c个关联词语的有效频次即为Num1,剩余的次数(Num-Num1)为无效频次,FtConnect c为所述目标词语与第c个关联词语之间的第一关联度。 Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, and ConNum c is the effective frequency of the c-th related word of the target word, Assuming that the number of occurrences of the c-th related word in the legal text database is Num, and the interval between Num1 times and the closest target word is less than the interval threshold, the effective frequency of the c-th related word is Num1 , The remaining number (Num-Num1) is the invalid frequency, and FtConnect c is the first degree of relevance between the target word and the c-th related word.
步骤S1043、在预设的第二词语向量数据库中分别查询所述目标词语的词语向量,以及各个关联词语的词语向量。Step S1043: Query the word vector of the target word and the word vector of each related word in the preset second word vector database.
步骤S1044、根据所述目标词语与各个关联词语之间的第一关联度,以及各个关联词语的词语向量,对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量。Step S1044: According to the first degree of relevance between the target word and each related word, and the word vector of each related word, update the word vector of the target word to obtain the updated word vector of the target word.
具体地,可以首先根据下式分别计算所述目标词语与各个关联词语之间的第二关联度:Specifically, the second degree of relevance between the target word and each related word may be calculated first according to the following formula:
Figure PCTCN2019116635-appb-000007
Figure PCTCN2019116635-appb-000007
其中,d为词语向量的维度序号,1≤d≤DN,DN为词语向量的维度总数,TgtElm d为所述目标词语的词语向量在第d个维度上的取值,CntElm c,d为所述目标词语的第c个关联词语的词语向量在第d个维度上的取值,SdConnect c为所述目标词语与第c个关联词语之间的第二关联度; Among them, d is the dimension number of the word vector, 1≤d≤DN, DN is the total number of dimensions of the word vector, TgtElm d is the value of the word vector of the target word in the dth dimension, and CntElm c,d is the value of the word vector. State the value of the word vector of the c-th related word of the target word in the d-th dimension, and SdConnect c is the second degree of relevance between the target word and the c-th related word;
然后,根据下式分别计算所述目标词语与各个关联词语之间的关联度误差:Then, calculate the correlation error between the target word and each related word according to the following formula:
ErrElm c=SdConnect c—FtConnect c ErrElm c =SdConnect c —FtConnect c
其中,ErrElm c为所述目标词语与第c个关联词语之间的关联度误差; Wherein, ErrElm c is the relevance error between the target word and the c-th related word;
最后,根据下式对所述目标词语的词语向量进行更新计算:Finally, update the word vector of the target word according to the following formula:
Figure PCTCN2019116635-appb-000008
Figure PCTCN2019116635-appb-000008
其中,λ为预设的更新系数,可以根据实际情况对其取值进行设置,例如,可以将其设置为0.01、0.001或者其它取值等等,NwTgtElm d为所述目标词语的更新词语向 量在第d个维度上的取值。 Among them, λ is a preset update coefficient, and its value can be set according to the actual situation, for example, it can be set to 0.01, 0.001 or other values, etc., NwTgtElm d is the update word vector of the target word in The value on the dth dimension.
步骤S1045、将所述目标词语的更新词语向量添加入所述第一词语向量数据库中。Step S1045: Add the updated word vector of the target word into the first word vector database.
按照这样的方式,遍历所述法律文本库的所有词语,对各个词语的词语向量均进行更新,得到对应的更新词语向量,最后将所有词语的更新词语向量构造出最终的所述第一词语向量数据库。In this way, all the words in the legal text database are traversed, the word vectors of each word are updated to obtain the corresponding updated word vectors, and finally the first word vector is constructed from the updated word vectors of all words database.
步骤S105、根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离。Step S105: Calculate the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set.
具体地,可以根据下式分别计算所述核心词子集与各个特征词集合之间的向量距离:Specifically, the vector distance between the core word subset and each feature word set can be calculated separately according to the following formula:
Figure PCTCN2019116635-appb-000009
Figure PCTCN2019116635-appb-000009
其中,k为所述核心词子集中的词语序号,1≤k≤KN,KN为所述核心词子集中的词语总数,t为各个存储分区的序号,1≤t≤TN,e为各个特征词集合中的词语序号,1≤e≤EN t,EN t为第t个特征词集合中的词语总数,第t个特征词集合为与第t个存储分区对应的特征词集合,KeyElm k,d为所述核心词子集中的第k个词语的词语向量在第d个维度上的取值,EigElm t,e,d为第t个特征词集合中的第e个词语的词语向量在第d个维度上的取值,Dis t为所述核心词子集与第t个特征词集合之间的向量距离。 Where k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the sequence number of each storage partition, 1≤t≤TN, and e is each feature The word sequence number in the word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set is the feature word set corresponding to the t-th storage partition, KeyElm k, d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, and EigElm t,e,d is the word vector of the e-th word in the t-th feature word set. Values in d dimensions, Dis t is the vector distance between the core word subset and the t-th feature word set.
步骤S106、将所述法律文本存储入优选存储分区中。Step S106: Store the legal text in the preferred storage partition.
所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。具体地,可以根据下式选取出所述法律文本所属的优选存储分区:The preferred storage partition is the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset. Specifically, the preferred storage partition to which the legal text belongs can be selected according to the following formula:
TgtLawDom=Argmin(DisSq)TgtLawDom=Argmin(DisSq)
=Argmin(Dis 1,Dis 2,......,Dis t,......,Dis TN) =Argmin(Dis 1 ,Dis 2 ,......,Dis t ,......,Dis TN )
其中,Argmin为最小自变量函数,DisSq为所述核心词子集的向量距离序列,且:DisSq=(Dis 1,Dis 2,......,Dis t,......,Dis TN),TgtLawDom为所述法律文本所属的优选存储分区的序号。 Among them, Argmin is the smallest independent variable function, DisSq is the vector distance sequence of the core word subset, and: DisSq=(Dis 1 ,Dis 2 ,......,Dis t ,......, Dis TN ), TgtLawDom is the serial number of the preferred storage partition to which the legal text belongs.
综上所述,在本申请实施例中,按照法律文本的实际核心内容进行存储,内容相似的法律文本会被存储入同一存储分区中,当用户需要查询相关资料时,仅需在对应的存储分区中进行查找即可,节省了对于人力成本的耗费,大大提高了工作效率。To sum up, in the embodiments of this application, the actual core content of the legal text is stored, and the legal texts with similar content will be stored in the same storage partition. When the user needs to query related materials, he only needs to store it in the corresponding storage. The search can be performed in the partition, which saves the labor cost and greatly improves the work efficiency.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
对应于上文实施例所述的一种法律文本存储方法,图4示出了本申请实施例提供的一种法律文本存储装置的一个实施例结构图。Corresponding to a legal text storage method described in the above embodiment, FIG. 4 shows a structural diagram of an embodiment of a legal text storage device provided in an embodiment of the present application.
本实施例中,一种法律文本存储装置可以包括:In this embodiment, a legal text storage device may include:
法律文本获取模块401,用于接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本;The legal text obtaining module 401 is configured to receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address;
第一分词处理模块402,用于对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;The first word segmentation processing module 402 is configured to perform word segmentation processing on the legal text to obtain a set of words that make up the legal text;
核心词子集选取模块403,用于从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;The core word subset selection module 403 is configured to select a core word subset from the word set. The core word subset includes those whose term density is greater than a preset first threshold and their uniformity is greater than a preset second threshold Various words
第一词语向量查询模块404,用于分别获取与各个存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量;The first word vector query module 404 is configured to obtain each feature word set corresponding to each storage partition, and respectively query the word vector of each word in the core word subset in the preset first word vector database, and The word vector of each word in each feature word set;
向量距离计算模块405,用于根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离;The vector distance calculation module 405 is configured to calculate the distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set. The vector distance;
分区存储模块406,用于将所述法律文本存储入优选存储分区中,所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。The partition storage module 406 is configured to store the legal text in a preferred storage partition, which is the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
进一步地,所述法律文本存储装置还可以包括:Further, the legal text storage device may further include:
第二分词处理模块,用于对预设的法律文本库中的各条法律文本进行分词处理,得到组成所述法律文本库的各个词语;The second word segmentation processing module is used to perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that constitutes the legal text database;
第一关联度计算模块,用于确定目标词语的各个关联词语,并分别计算所述目标词语与各个关联词语之间的第一关联度,所述目标词语为组成所述法律文本库的任意一个词语;The first degree of relevance calculation module is used to determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, and the target word is any one that composes the legal text database Words
第二词语向量查询模块,用于在预设的第二词语向量数据库中分别查询所述目标词语的词语向量,以及各个关联词语的词语向量;The second word vector query module is used to query the word vector of the target word and the word vector of each related word in the preset second word vector database;
更新计算模块,用于根据所述目标词语与各个关联词语之间的第一关联度,以及各个关联词语的词语向量,对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量;The update calculation module is used to update and calculate the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the update of the target word Word vector
向量添加模块,用于将所述目标词语的更新词语向量添加入所述第一词语向量数 据库中。The vector adding module is used to add the updated word vector of the target word to the first word vector database.
进一步地,所述更新计算模块可以包括:Further, the update calculation module may include:
第一计算单元,用于分别计算所述目标词语与各个关联词语之间的第二关联度;The first calculation unit is configured to calculate the second degree of relevance between the target word and each related word;
第二计算单元,用于分别计算所述目标词语与各个关联词语之间的关联度误差;The second calculation unit is used to calculate the correlation error between the target word and each related word respectively;
第三计算单元,用于对所述目标词语的词语向量进行更新计算。The third calculation unit is used to update and calculate the word vector of the target word.
进一步地,所述核心词子集选取模块可以包括:Further, the core word subset selection module may include:
词条密度计算单元,用于分别计算所述词语集合中的各个词语的词条密度;The term density calculation unit is used to calculate the term density of each word in the word set;
均匀度计算单元,用于分别计算所述词语集合中的各个词语的均匀度;A uniformity calculation unit for calculating the uniformity of each word in the word set;
核心词子集选取单元,用于从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置,模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working processes of the above described devices, modules and units can refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
图5示出了本申请实施例提供的一种终端设备的示意框图,为了便于说明,仅示出了与本申请实施例相关的部分。FIG. 5 shows a schematic block diagram of a terminal device provided by an embodiment of the present application. For ease of description, only parts related to the embodiment of the present application are shown.
在本实施例中,所述终端设备5可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备5可包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机可读指令52,例如执行上述的法律文本存储方法的计算机可读指令。所述处理器50执行所述计算机可读指令52时实现上述各个法律文本存储方法实施例中的步骤,例如图1所示的步骤S101至S106。或者,所述处理器50执行所述计算机可读指令52时实现上述各装置实施例中各模块/单元的功能,例如图4所示模块401至406的功能。In this embodiment, the terminal device 5 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device 5 may include: a processor 50, a memory 51, and computer-readable instructions 52 stored in the memory 51 and running on the processor 50, for example, a computer-readable instruction that executes the foregoing legal text storage method instruction. When the processor 50 executes the computer-readable instructions 52, the steps in the foregoing legal text storage method embodiments are implemented, for example, steps S101 to S106 shown in FIG. 1. Alternatively, when the processor 50 executes the computer-readable instructions 52, the functions of the modules/units in the foregoing device embodiments, such as the functions of the modules 401 to 406 shown in FIG. 4, are implemented.
示例性的,所述计算机可读指令52可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令52在所述终端设备5中的执行过程。Exemplarily, the computer-readable instructions 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50, To complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 52 in the terminal device 5.
所述处理器50可以是中央处理单元(Central Processing Unit,CPU),还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 50 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
所述存储器51可以是所述终端设备5的内部存储单元,例如终端设备5的硬盘或内存。所述存储器51也可以是所述终端设备5的外部存储设备,例如所述终端设备5上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器51还可以既包括所述终端设备5的内部存储单元也包括外部存储设备。所述存储器51用于存储所述计算机可读指令以及所述终端设备5所需的其它指令和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk equipped on the terminal device 5, a smart memory card (Smart Media Card, SMC), and a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 51 may also include both an internal storage unit of the terminal device 5 and an external storage device. The memory 51 is used to store the computer-readable instructions and other instructions and data required by the terminal device 5. The memory 51 can also be used to temporarily store data that has been output or will be output.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the method of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种法律文本存储方法,其特征在于,包括:A method for storing legal text, which is characterized in that it includes:
    接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;
    对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that constitute the legal text;
    从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
    分别获取与各个预设的存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量;Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector
    根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离;Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;
    将所述法律文本存储入优选存储分区中,所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  2. 根据权利要求1所述的法律文本存储方法,其特征在于,所述第一词语向量数据库的设置过程包括:The legal text storage method according to claim 1, wherein the setting process of the first word vector database comprises:
    对预设的法律文本库中的各条法律文本进行分词处理,得到组成所述法律文本库的各个词语;Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database;
    确定目标词语的各个关联词语,并分别计算所述目标词语与各个关联词语之间的第一关联度,所述目标词语为组成所述法律文本库的任意一个词语;Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, where the target word is any word that composes the legal text database;
    在预设的第二词语向量数据库中分别查询所述目标词语的词语向量,以及各个关联词语的词语向量;Respectively query the word vector of the target word and the word vector of each related word in the preset second word vector database;
    根据所述目标词语与各个关联词语之间的第一关联度,以及各个关联词语的词语向量,对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量;Update the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the updated word vector of the target word;
    将所述目标词语的更新词语向量添加入所述第一词语向量数据库中。The updated word vector of the target word is added to the first word vector database.
  3. 根据权利要求2所述的法律文本存储方法,其特征在于,所述对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量包括:The legal text storage method according to claim 2, wherein said updating the word vector of the target word to obtain the updated word vector of the target word comprises:
    根据下式分别计算所述目标词语与各个关联词语之间的第二关联度:Calculate the second degree of relevance between the target word and each related word according to the following formula:
    Figure PCTCN2019116635-appb-100001
    Figure PCTCN2019116635-appb-100001
    其中,c为所述目标词语的各个关联词语的序号,1≤c≤CN,CN为所述目标词语的关联词语的总数,d为词语向量的维度序号,1≤d≤DN,DN为词语向量的维度总数,TgtElm d为所述目标词语的词语向量在第d个维度上的取值,CntElm c,d为所述目标词语的第c个关联词语的词语向量在第d个维度上的取值,SdConnect c为所述目标词语与第c个关联词语之间的第二关联度; Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;
    根据下式分别计算所述目标词语与各个关联词语之间的关联度误差:Calculate the correlation error between the target word and each related word according to the following formula:
    ErrElm c=SdConnect c—FtConnect c ErrElm c =SdConnect c —FtConnect c
    其中,FtConnect c为所述目标词语与第c个关联词语之间的第一关联度,ErrElm c为所述目标词语与第c个关联词语之间的关联度误差; Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;
    根据下式对所述目标词语的词语向量进行更新计算:The word vector of the target word is updated and calculated according to the following formula:
    Figure PCTCN2019116635-appb-100002
    Figure PCTCN2019116635-appb-100002
    其中,λ为预设的更新系数,NwTgtElm d为所述目标词语的更新词语向量在第d个维度上的取值。 Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
  4. 根据权利要求1所述的法律文本存储方法,其特征在于,所述分别计算所述核心词子集与各个特征词集合之间的向量距离包括:The legal text storage method according to claim 1, wherein the calculating the vector distance between the core word subset and each feature word set respectively comprises:
    根据下式分别计算所述核心词子集与各个特征词集合之间的向量距离:Calculate the vector distance between the core word subset and each feature word set according to the following formula:
    Figure PCTCN2019116635-appb-100003
    Figure PCTCN2019116635-appb-100003
    其中,k为所述核心词子集中的词语序号,1≤k≤KN,KN为所述核心词子集中的词语总数,t为各个存储分区的序号,1≤t≤TN,TN为存储分区的总数,e为各个特征词集合中的词语序号,1≤e≤EN t,EN t为第t个特征词集合中的词语总数,第t个特征词集合为与第t个存储分区对应的特征词集合,KeyElm k,d为所述核心词子集中的第k个词语的词语向量在第d个维度上的取值,EigElm t,e,d为第t个特征词集合中的第e个词语的词语向量在第d个维度上的取值,Dis t为所述核心词子集与第t个特征词集合之间的向量距离。 Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
  5. 根据权利要求1至4中任一项所述的法律文本存储方法,其特征在于,从所述 词语集合中选取核心词子集包括:The legal text storage method according to any one of claims 1 to 4, wherein selecting a core word subset from the word set comprises:
    根据下式分别计算所述词语集合中的各个词语的词条密度:Calculate the entry density of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100004
    Figure PCTCN2019116635-appb-100004
    其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合中的词语数目,WdNum w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity w为所述词语集合中的第w个词语的词条密度; Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;
    将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.
    根据下式分别计算所述词语集合中的各个词语的均匀度:Calculate the uniformity of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100005
    Figure PCTCN2019116635-appb-100005
    其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag w,f为所述词语集合中的第w个词语在第f个文本段落中的出现情况的标志位,且
    Figure PCTCN2019116635-appb-100006
    WdEqu w为所述词语集合中的第w个词语的均匀度;
    Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
    Figure PCTCN2019116635-appb-100006
    WdEqu w is the uniformity of the wth word in the word set;
    从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
  6. 一种法律文本存储装置,其特征在于,包括:A legal text storage device, characterized in that it comprises:
    法律文本获取模块,用于接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本;A legal text acquisition module, configured to receive a legal text storage instruction, extract a target address in the legal text storage instruction, and obtain a legal text in the target address;
    第一分词处理模块,用于对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;The first word segmentation processing module is used to perform word segmentation processing on the legal text to obtain a set of words that constitute the legal text;
    核心词子集选取模块,用于从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;The core word subset selection module is configured to select a core word subset from the word set. The core word subset includes each item whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold Words
    第一词语向量查询模块,用于分别获取与各个存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量;The first word vector query module is used to obtain each feature word set corresponding to each storage partition, and respectively query the word vector of each word in the core word subset in the preset first word vector database, and each The word vector of each word in the feature word set;
    向量距离计算模块,用于根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离;The vector distance calculation module is used to calculate the difference between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set. Vector distance
    分区存储模块,用于将所述法律文本存储入优选存储分区中,所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。The partition storage module is configured to store the legal text in a preferred storage partition, the preferred storage partition being the storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  7. 根据权利要求6所述的法律文本存储装置,其特征在于,还包括:The legal text storage device according to claim 6, further comprising:
    第二分词处理模块,用于对预设的法律文本库中的各条法律文本进行分词处理,得到组成所述法律文本库的各个词语;The second word segmentation processing module is used to perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that constitutes the legal text database;
    第一关联度计算模块,用于确定目标词语的各个关联词语,并分别计算所述目标词语与各个关联词语之间的第一关联度,所述目标词语为组成所述法律文本库的任意一个词语;The first degree of relevance calculation module is used to determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, and the target word is any one that composes the legal text database Words
    第二词语向量查询模块,用于在预设的第二词语向量数据库中分别查询所述目标词语的词语向量,以及各个关联词语的词语向量;The second word vector query module is used to query the word vector of the target word and the word vector of each related word in the preset second word vector database;
    更新计算模块,用于根据所述目标词语与各个关联词语之间的第一关联度,以及各个关联词语的词语向量,对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量;The update calculation module is used to update and calculate the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the update of the target word Word vector
    向量添加模块,用于将所述目标词语的更新词语向量添加入所述第一词语向量数据库中。The vector adding module is used to add the updated word vector of the target word into the first word vector database.
  8. 根据权利要求7所述的法律文本存储装置,其特征在于,所述更新计算模块包括:The legal text storage device according to claim 7, wherein the update calculation module comprises:
    第一计算单元,用于根据下式分别计算所述目标词语与各个关联词语之间的第二关联度:The first calculation unit is configured to calculate the second degree of relevance between the target word and each related word according to the following formula:
    Figure PCTCN2019116635-appb-100007
    Figure PCTCN2019116635-appb-100007
    其中,c为所述目标词语的各个关联词语的序号,1≤c≤CN,CN为所述目标词语的关联词语的总数,d为词语向量的维度序号,1≤d≤DN,DN为词语向量的维度总数,TgtElm d为所述目标词语的词语向量在第d个维度上的取值,CntElm c,d为所述目标词语的第c个关联词语的词语向量在第d个维度上的取值,SdConnect c为所述目标词语与第c个关联词语之间的第二关联度; Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d-th dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d-th dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;
    第二计算单元,用于根据下式分别计算所述目标词语与各个关联词语之间的关联度误差:The second calculation unit is configured to calculate the correlation error between the target word and each related word according to the following formula:
    ErrElm c=SdConnect c—FtConnect c ErrElm c =SdConnect c —FtConnect c
    其中,FtConnect c为所述目标词语与第c个关联词语之间的第一关联度,ErrElm c为所述目标词语与第c个关联词语之间的关联度误差; Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;
    第三计算单元,用于根据下式对所述目标词语的词语向量进行更新计算:The third calculation unit is used to update and calculate the word vector of the target word according to the following formula:
    Figure PCTCN2019116635-appb-100008
    Figure PCTCN2019116635-appb-100008
    其中,λ为预设的更新系数,NwTgtElm d为所述目标词语的更新词语向量在第d个维度上的取值。 Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
  9. 根据权利要求6所述的法律文本存储装置,其特征在于,所述向量距离计算模块具体用于根据下式分别计算所述核心词子集与各个特征词集合之间的向量距离:The legal text storage device according to claim 6, wherein the vector distance calculation module is specifically configured to calculate the vector distance between the core word subset and each feature word set according to the following formula:
    Figure PCTCN2019116635-appb-100009
    Figure PCTCN2019116635-appb-100009
    其中,k为所述核心词子集中的词语序号,1≤k≤KN,KN为所述核心词子集中的词语总数,t为各个存储分区的序号,1≤t≤TN,TN为存储分区的总数,e为各个特征词集合中的词语序号,1≤e≤EN t,EN t为第t个特征词集合中的词语总数,第t个特征词集合为与第t个存储分区对应的特征词集合,KeyElm k,d为所述核心词子集中的第k个词语的词语向量在第d个维度上的取值,EigElm t,e,d为第t个特征词集合中的第e个词语的词语向量在第d个维度上的取值,Dis t为所述核心词子集与第t个特征词集合之间的向量距离。 Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
  10. 根据权利要求6至9中任一项所述的法律文本存储装置,其特征在于,所述核心词子集选取模块包括:The legal text storage device according to any one of claims 6 to 9, wherein the core word subset selection module comprises:
    词条密度计算单元,用于根据下式分别计算所述词语集合中的各个词语的词条密度:The term density calculation unit is used to calculate the term density of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100010
    Figure PCTCN2019116635-appb-100010
    其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合 中的词语数目,WdNum w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity w为所述词语集合中的第w个词语的词条密度; Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;
    文本段落划分单元,用于将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;The text paragraph dividing unit is used to divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, and FN is an integer greater than one;
    均匀度计算单元,用于根据下式分别计算所述词语集合中的各个词语的均匀度:The uniformity calculation unit is used to calculate the uniformity of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100011
    Figure PCTCN2019116635-appb-100011
    其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag w,f为所述词语集合中的第w个词语在第f个文本段落中的出现情况的标志位,且
    Figure PCTCN2019116635-appb-100012
    WdEqu w为所述词语集合中的第w个词语的均匀度;
    Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
    Figure PCTCN2019116635-appb-100012
    WdEqu w is the uniformity of the wth word in the word set;
    核心词子集选取单元,用于从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。The core word subset selection unit is configured to select, from the word set, each word whose term density is greater than the first threshold and the uniformity is greater than the second threshold to form the core word subset.
  11. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A computer non-volatile readable storage medium, the computer non-volatile readable storage medium storing computer readable instructions, characterized in that the computer readable instructions are executed by a processor to implement the following steps:
    接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;
    对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;
    从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
    分别获取与各个预设的存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量;Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector
    根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离;Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;
    将所述法律文本存储入优选存储分区中,所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  12. 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述第一词语向量数据库的设置过程包括:The computer non-volatile readable storage medium according to claim 11, wherein the setting process of the first word vector database comprises:
    对预设的法律文本库中的各条法律文本进行分词处理,得到组成所述法律文本库的各个词语;Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database;
    确定目标词语的各个关联词语,并分别计算所述目标词语与各个关联词语之间的第一关联度,所述目标词语为组成所述法律文本库的任意一个词语;Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, where the target word is any word that composes the legal text database;
    在预设的第二词语向量数据库中分别查询所述目标词语的词语向量,以及各个关联词语的词语向量;Respectively query the word vector of the target word and the word vector of each related word in the preset second word vector database;
    根据所述目标词语与各个关联词语之间的第一关联度,以及各个关联词语的词语向量,对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量;Update the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the updated word vector of the target word;
    将所述目标词语的更新词语向量添加入所述第一词语向量数据库中。The updated word vector of the target word is added to the first word vector database.
  13. 根据权利要求12所述的计算机非易失性可读存储介质,其特征在于,所述对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量包括:The computer non-volatile readable storage medium according to claim 12, wherein said updating the word vector of the target word to obtain the updated word vector of the target word comprises:
    根据下式分别计算所述目标词语与各个关联词语之间的第二关联度:Calculate the second degree of relevance between the target word and each related word according to the following formula:
    Figure PCTCN2019116635-appb-100013
    Figure PCTCN2019116635-appb-100013
    其中,c为所述目标词语的各个关联词语的序号,1≤c≤CN,CN为所述目标词语的关联词语的总数,d为词语向量的维度序号,1≤d≤DN,DN为词语向量的维度总数,TgtElm d为所述目标词语的词语向量在第d个维度上的取值,CntElm c,d为所述目标词语的第c个关联词语的词语向量在第d个维度上的取值,SdConnect c为所述目标词语与第c个关联词语之间的第二关联度; Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d-th dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d-th dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;
    根据下式分别计算所述目标词语与各个关联词语之间的关联度误差:Calculate the correlation error between the target word and each related word according to the following formula:
    ErrElm c=SdConnect c—FtConnect c ErrElm c =SdConnect c —FtConnect c
    其中,FtConnect c为所述目标词语与第c个关联词语之间的第一关联度,ErrElm c为所述目标词语与第c个关联词语之间的关联度误差; Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;
    根据下式对所述目标词语的词语向量进行更新计算:The word vector of the target word is updated and calculated according to the following formula:
    Figure PCTCN2019116635-appb-100014
    Figure PCTCN2019116635-appb-100014
    其中,λ为预设的更新系数,NwTgtElm d为所述目标词语的更新词语向量在第d个维度上的取值。 Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
  14. 根据权利要求11所述的计算机非易失性可读存储介质,其特征在于,所述分别计算所述核心词子集与各个特征词集合之间的向量距离包括:The computer non-volatile readable storage medium according to claim 11, wherein said calculating the vector distance between the core word subset and each feature word set respectively comprises:
    根据下式分别计算所述核心词子集与各个特征词集合之间的向量距离:Calculate the vector distance between the core word subset and each feature word set according to the following formula:
    Figure PCTCN2019116635-appb-100015
    Figure PCTCN2019116635-appb-100015
    其中,k为所述核心词子集中的词语序号,1≤k≤KN,KN为所述核心词子集中的词语总数,t为各个存储分区的序号,1≤t≤TN,TN为存储分区的总数,e为各个特征词集合中的词语序号,1≤e≤EN t,EN t为第t个特征词集合中的词语总数,第t个特征词集合为与第t个存储分区对应的特征词集合,KeyElm k,d为所述核心词子集中的第k个词语的词语向量在第d个维度上的取值,EigElm t,e,d为第t个特征词集合中的第e个词语的词语向量在第d个维度上的取值,Dis t为所述核心词子集与第t个特征词集合之间的向量距离。 Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
  15. 根据权利要求11至14中任一项所述的计算机非易失性可读存储介质,其特征在于,从所述词语集合中选取核心词子集包括:The computer non-volatile readable storage medium according to any one of claims 11 to 14, wherein selecting a core word subset from the word set comprises:
    根据下式分别计算所述词语集合中的各个词语的词条密度:Calculate the entry density of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100016
    Figure PCTCN2019116635-appb-100016
    其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合中的词语数目,WdNum w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity w为所述词语集合中的第w个词语的词条密度; Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;
    将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.
    根据下式分别计算所述词语集合中的各个词语的均匀度:Calculate the uniformity of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100017
    Figure PCTCN2019116635-appb-100017
    其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag w,f为所述词语集合中的第w个词语在第f个文本段落中的出现情况的标志位,且
    Figure PCTCN2019116635-appb-100018
    WdEqu w为所述词语集合中的第w个词语的均匀度;
    Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
    Figure PCTCN2019116635-appb-100018
    WdEqu w is the uniformity of the wth word in the word set;
    从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
  16. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A terminal device, comprising a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, wherein the processor executes the computer-readable instructions as follows step:
    接收法律文本存储指令,提取所述法律文本存储指令中的目标地址,并获取所述目标地址中的法律文本;Receiving a legal text storage instruction, extracting the target address in the legal text storing instruction, and obtaining the legal text in the target address;
    对所述法律文本进行分词处理,得到组成所述法律文本的词语集合;Perform word segmentation processing on the legal text to obtain a collection of words that make up the legal text;
    从所述词语集合中选取核心词子集,所述核心词子集中包括词条密度大于预设的第一阈值且均匀度大于预设的第二阈值的各个词语;Selecting a core word subset from the word set, the core word subset including each word whose term density is greater than a preset first threshold and evenness is greater than a preset second threshold;
    分别获取与各个预设的存储分区对应的各个特征词集合,并在预设的第一词语向量数据库中分别查询所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量;Obtain each feature word set corresponding to each preset storage partition, and query the word vector of each word in the core word subset and each feature word set in the preset first word vector database. Word vector
    根据所述核心词子集中的各个词语的词语向量,以及各个特征词集合中的各个词语的词语向量,分别计算所述核心词子集与各个特征词集合之间的向量距离;Respectively calculating the vector distance between the core word subset and each feature word set according to the word vector of each word in the core word subset and the word vector of each word in each feature word set;
    将所述法律文本存储入优选存储分区中,所述优选存储分区为与所述核心词子集之间的向量距离最小的特征词集合所对应的存储分区。The legal text is stored in a preferred storage partition, which is a storage partition corresponding to the feature word set with the smallest vector distance between the core word subset.
  17. 根据权利要求16所述的终端设备,其特征在于,所述第一词语向量数据库的设置过程包括:The terminal device according to claim 16, wherein the setting process of the first word vector database comprises:
    对预设的法律文本库中的各条法律文本进行分词处理,得到组成所述法律文本库的各个词语;Perform word segmentation processing on each piece of legal text in the preset legal text database to obtain each word that composes the legal text database;
    确定目标词语的各个关联词语,并分别计算所述目标词语与各个关联词语之间的第一关联度,所述目标词语为组成所述法律文本库的任意一个词语;Determine each related word of the target word, and respectively calculate the first degree of relevance between the target word and each related word, where the target word is any word that composes the legal text database;
    在预设的第二词语向量数据库中分别查询所述目标词语的词语向量,以及各个关联词语的词语向量;Respectively query the word vector of the target word and the word vector of each related word in the preset second word vector database;
    根据所述目标词语与各个关联词语之间的第一关联度,以及各个关联词语的词语 向量,对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量;Update the word vector of the target word according to the first degree of relevance between the target word and each related word and the word vector of each related word to obtain the updated word vector of the target word;
    将所述目标词语的更新词语向量添加入所述第一词语向量数据库中。The updated word vector of the target word is added to the first word vector database.
  18. 根据权利要求17所述的终端设备,其特征在于,所述对所述目标词语的词语向量进行更新计算,得到所述目标词语的更新词语向量包括:The terminal device according to claim 17, wherein the updating and calculating the word vector of the target word to obtain the updated word vector of the target word comprises:
    根据下式分别计算所述目标词语与各个关联词语之间的第二关联度:Calculate the second degree of relevance between the target word and each related word according to the following formula:
    Figure PCTCN2019116635-appb-100019
    Figure PCTCN2019116635-appb-100019
    其中,c为所述目标词语的各个关联词语的序号,1≤c≤CN,CN为所述目标词语的关联词语的总数,d为词语向量的维度序号,1≤d≤DN,DN为词语向量的维度总数,TgtElm d为所述目标词语的词语向量在第d个维度上的取值,CntElm c,d为所述目标词语的第c个关联词语的词语向量在第d个维度上的取值,SdConnect c为所述目标词语与第c个关联词语之间的第二关联度; Where c is the serial number of each related word of the target word, 1≤c≤CN, CN is the total number of related words of the target word, d is the dimensional serial number of the word vector, 1≤d≤DN, and DN is the word The total number of dimensions of the vector, TgtElm d is the value of the word vector of the target word in the d-th dimension, and CntElm c,d is the word vector of the c-th related word of the target word in the d-th dimension Value, SdConnect c is the second degree of relevance between the target word and the c-th related word;
    根据下式分别计算所述目标词语与各个关联词语之间的关联度误差:Calculate the correlation error between the target word and each related word according to the following formula:
    ErrElm c=SdConnect c—FtConnect c ErrElm c =SdConnect c —FtConnect c
    其中,FtConnect c为所述目标词语与第c个关联词语之间的第一关联度,ErrElm c为所述目标词语与第c个关联词语之间的关联度误差; Wherein, FtConnect c is the first degree of relevance between the target word and the c-th related word, and ErrElm c is the degree of relevance error between the target word and the c-th related word;
    根据下式对所述目标词语的词语向量进行更新计算:The word vector of the target word is updated and calculated according to the following formula:
    Figure PCTCN2019116635-appb-100020
    Figure PCTCN2019116635-appb-100020
    其中,λ为预设的更新系数,NwTgtElm d为所述目标词语的更新词语向量在第d个维度上的取值。 Where λ is a preset update coefficient, and NwTgtElm d is the value of the update word vector of the target word in the dth dimension.
  19. 根据权利要求16所述的终端设备,其特征在于,所述分别计算所述核心词子集与各个特征词集合之间的向量距离包括:The terminal device according to claim 16, wherein the calculating the vector distance between the core word subset and each feature word set respectively comprises:
    根据下式分别计算所述核心词子集与各个特征词集合之间的向量距离:Calculate the vector distance between the core word subset and each feature word set according to the following formula:
    Figure PCTCN2019116635-appb-100021
    Figure PCTCN2019116635-appb-100021
    其中,k为所述核心词子集中的词语序号,1≤k≤KN,KN为所述核心词子集中的词语总数,t为各个存储分区的序号,1≤t≤TN,TN为存储分区的总数,e为各个特征词集合中的词语序号,1≤e≤EN t,EN t为第t个特征词集合中的词语总数,第t个特征词集合为与第t个存储分区对应的特征词集合,KeyElm k,d为所述核心词子集中的第k个词语的词语向量在第d个维度上的取值,EigElm t,e,d为第t个特征词集合中的第e个词语的词语向量在第d个维度上的取值,Dis t为所述核心词子集与第t个特征词集合之间的向量距离。 Wherein, k is the word sequence number in the core word subset, 1≤k≤KN, KN is the total number of words in the core word subset, t is the serial number of each storage partition, 1≤t≤TN, and TN is the storage partition E is the sequence number of each feature word set, 1≤e≤EN t , EN t is the total number of words in the t-th feature word set, and the t-th feature word set corresponds to the t-th storage partition Feature word set, KeyElm k,d is the value of the word vector of the k-th word in the core word subset in the d-th dimension, EigElm t,e,d is the e-th in the t-th characteristic word set The value of the word vector of each word in the d-th dimension, Dis t is the vector distance between the core word subset and the t-th feature word set.
  20. 根据权利要求16至19中任一项所述的终端设备,其特征在于,从所述词语集合中选取核心词子集包括:The terminal device according to any one of claims 16 to 19, wherein selecting a core word subset from the word set comprises:
    根据下式分别计算所述词语集合中的各个词语的词条密度:Calculate the entry density of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100022
    Figure PCTCN2019116635-appb-100022
    其中,w为所述词语集合中的各个词语的序号,1≤w≤WN,WN为所述词语集合中的词语数目,WdNum w为所述词语集合中的第w个词语在所述法律文本中出现的次数,LineNum为所述法律文本的总行数,WdDensity w为所述词语集合中的第w个词语的词条密度; Where w is the serial number of each word in the word set, 1≤w≤WN, WN is the number of words in the word set, and WdNum w is the wth word in the word set in the legal text The number of occurrences in, LineNum is the total number of lines in the legal text, and WdDensity w is the entry density of the w-th word in the word set;
    将所述法律文本划分为FN个文本段落,并分别统计所述词语集合中的各个词语在各个文本段落中的出现情况,FN为大于1的整数;Divide the legal text into FN text paragraphs, and respectively count the occurrence of each word in the word set in each text paragraph, where FN is an integer greater than 1.
    根据下式分别计算所述词语集合中的各个词语的均匀度:Calculate the uniformity of each word in the word set according to the following formula:
    Figure PCTCN2019116635-appb-100023
    Figure PCTCN2019116635-appb-100023
    其中,f为所述法律文本的各个文本段落的序号,1≤f≤FN,Flag w,f为所述词语集合中的第w个词语在第f个文本段落中的出现情况的标志位,且
    Figure PCTCN2019116635-appb-100024
    WdEqu w为所述词语集合中的第w个词语的均匀度;
    Where, f is the serial number of each text paragraph of the legal text, 1≤f≤FN, and Flag w,f is the flag bit of the occurrence of the wth word in the word set in the fth text paragraph, And
    Figure PCTCN2019116635-appb-100024
    WdEqu w is the uniformity of the wth word in the word set;
    从所述词语集合中选取词条密度大于所述第一阈值且均匀度大于所述第二阈值的各个词语组成所述核心词子集。Each word with a word density greater than the first threshold and a uniformity greater than the second threshold is selected from the word set to form the core word subset.
PCT/CN2019/116635 2019-09-03 2019-11-08 Legal text storage method and device, readable storage medium and terminal device WO2021042511A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910826805.4 2019-09-03
CN201910826805.4A CN110765230B (en) 2019-09-03 2019-09-03 Legal text storage method and device, readable storage medium and terminal equipment

Publications (1)

Publication Number Publication Date
WO2021042511A1 true WO2021042511A1 (en) 2021-03-11

Family

ID=69329300

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116635 WO2021042511A1 (en) 2019-09-03 2019-11-08 Legal text storage method and device, readable storage medium and terminal device

Country Status (2)

Country Link
CN (1) CN110765230B (en)
WO (1) WO2021042511A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113495954A (en) * 2020-03-20 2021-10-12 北京沃东天骏信息技术有限公司 Text data determination method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262966A1 (en) * 2012-04-02 2013-10-03 Industrial Technology Research Institute Digital content reordering method and digital content aggregator
US20140181109A1 (en) * 2012-12-22 2014-06-26 Industrial Technology Research Institute System and method for analysing text stream message thereof
CN107885749A (en) * 2016-09-30 2018-04-06 南京理工大学 Ontology extends the process knowledge search method with collaborative filtering Weighted Fusion
CN108804617A (en) * 2018-05-30 2018-11-13 广州杰赛科技股份有限公司 Field term abstracting method, device, terminal device and storage medium
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH096799A (en) * 1995-06-19 1997-01-10 Sharp Corp Document sorting device and document retrieving device
US6772149B1 (en) * 1999-09-23 2004-08-03 Lexis-Nexis Group System and method for identifying facts and legal discussion in court case law documents
US20150113388A1 (en) * 2013-10-22 2015-04-23 Qualcomm Incorporated Method and apparatus for performing topic-relevance highlighting of electronic text
CN106407442B (en) * 2016-09-28 2019-11-29 中国银行股份有限公司 A kind of mass text data processing method and device
CN111611798B (en) * 2017-01-22 2023-05-16 创新先进技术有限公司 Word vector processing method and device
US20190108276A1 (en) * 2017-10-10 2019-04-11 NEGENTROPICS Mesterséges Intelligencia Kutató és Fejlesztõ Kft Methods and system for semantic search in large databases
CN108334605B (en) * 2018-02-01 2020-06-16 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN109408636A (en) * 2018-09-29 2019-03-01 新华三大数据技术有限公司 File classification method and device
CN109408639B (en) * 2018-10-31 2022-05-31 广州虎牙科技有限公司 Bullet screen classification method, bullet screen classification device, bullet screen classification equipment and storage medium
CN109840051B (en) * 2018-12-27 2020-08-07 华为技术有限公司 Data storage method and device of storage system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262966A1 (en) * 2012-04-02 2013-10-03 Industrial Technology Research Institute Digital content reordering method and digital content aggregator
US20140181109A1 (en) * 2012-12-22 2014-06-26 Industrial Technology Research Institute System and method for analysing text stream message thereof
CN107885749A (en) * 2016-09-30 2018-04-06 南京理工大学 Ontology extends the process knowledge search method with collaborative filtering Weighted Fusion
CN108804617A (en) * 2018-05-30 2018-11-13 广州杰赛科技股份有限公司 Field term abstracting method, device, terminal device and storage medium
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning

Also Published As

Publication number Publication date
CN110765230A (en) 2020-02-07
CN110765230B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN107944480B (en) Enterprise industry classification method
WO2020114022A1 (en) Knowledge base alignment method and apparatus, computer device and storage medium
WO2020207074A1 (en) Information pushing method and device
WO2022142027A1 (en) Knowledge graph-based fuzzy matching method and apparatus, computer device, and storage medium
CN112256874A (en) Model training method, text classification method, device, computer equipment and medium
EP3179387A1 (en) Analytic systems, methods, and computer-readable media for structured, semi-structured, and unstructured documents
WO2022126963A1 (en) Customer profiling method based on customer response corpora, and device related thereto
CN110569289B (en) Column data processing method, equipment and medium based on big data
WO2020248379A1 (en) Method for searching for similar network pages, and apparatus
CN111291177A (en) Information processing method and device and computer storage medium
CN111159546A (en) Event pushing method and device, computer readable storage medium and computer equipment
CN111125366B (en) Text classification method and device
WO2021042511A1 (en) Legal text storage method and device, readable storage medium and terminal device
CN117453852B (en) File updating management method based on cloud storage
WO2023151576A1 (en) Search recommendation method, search recommendation system, computer device and storage medium
WO2022257455A1 (en) Determination metod and apparatus for similar text, and terminal device and storage medium
CN113792131B (en) Keyword extraction method and device, electronic equipment and storage medium
Pérez-Sancho et al. Style recognition through statistical event models
CN114691835A (en) Audit plan data generation method, device and equipment based on text mining
CN114547257A (en) Class matching method and device, computer equipment and storage medium
CN115587231A (en) Data combination processing and rapid storage and retrieval method based on cloud computing platform
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
Faraz A comparison of text Categorization methods
CN109783816B (en) Short text clustering method and terminal equipment
WO2021042554A1 (en) Method and apparatus for archiving legal text, readable storage medium, and terminal device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944376

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944376

Country of ref document: EP

Kind code of ref document: A1