CN113535936A - Deep learning-based regulation and regulation retrieval method and system - Google Patents
Deep learning-based regulation and regulation retrieval method and system Download PDFInfo
- Publication number
- CN113535936A CN113535936A CN202110686425.2A CN202110686425A CN113535936A CN 113535936 A CN113535936 A CN 113535936A CN 202110686425 A CN202110686425 A CN 202110686425A CN 113535936 A CN113535936 A CN 113535936A
- Authority
- CN
- China
- Prior art keywords
- text
- regulation
- model
- chinese
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000013135 deep learning Methods 0.000 title claims abstract description 23
- 230000011218 segmentation Effects 0.000 claims abstract description 53
- 238000004364 calculation method Methods 0.000 claims abstract description 22
- 239000002131 composite material Substances 0.000 claims abstract description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 36
- 238000012549 training Methods 0.000 claims description 28
- 230000001105 regulatory effect Effects 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 14
- 238000012360 testing method Methods 0.000 claims description 13
- 238000012216 screening Methods 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 6
- 101001013832 Homo sapiens Mitochondrial peptide methionine sulfoxide reductase Proteins 0.000 claims description 3
- 102100031767 Mitochondrial peptide methionine sulfoxide reductase Human genes 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 2
- 239000013598 vector Substances 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a deep learning-based regulation retrieval method and a system, wherein the method comprises the following steps: 1. acquiring a query text input by a user; 2. acquiring target word segmentation of the query text and attributes of the target word segmentation; 3. constructing a regulation and regulation database; 4. according to the target word and its attribute, making search in the regulation and regulation database and calculating matching degree X based on wordn(ii) a 5. Calculating a semantic-based degree of match Yn(ii) a 6. According to XnAnd YnCalculating the composite matching degree Zn(ii) a 7. According to the composite matching degree ZnAnd inquiring the target word segmentation attributes of the text and the specific hierarchical relation in the rule system to finally obtain a plurality of inverted retrieval results. The method realizes a Chinese text word segmentation model, a Chinese text dependency syntax analysis model, an OCR character recognition model and an ESIM text similarity calculation model on the basis of deep learning, and realizes quick and accurate retrieval of a regulation system.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a deep learning-based regulation and regulation retrieval method and system.
Background
The current regulations (national laws and regulations, provincial regulations and enterprise regulations) are so numerous that it is difficult for a general person to become familiar with the regulations and to quickly handle the regulations in some cases. The existing general search engine is not subjected to targeted optimization in the aspect of regulation retrieval, has certain deviation on semantic analysis, has poor retrieval effect, and is specifically represented by the fact that no professional comprehensive regulation database exists and the retrieval matching based on semantic hierarchy exists. Therefore, the intelligent retrieval method and system for a certain retrieval word or statement based on the existing regulatory library and deep learning are developed, and have extremely high practical significance and application value.
Disclosure of Invention
In view of the above, the present invention provides a method and system for searching a regulation based on deep learning, which aims to solve the technical problems that people are difficult to accurately obtain the specific content of the corresponding regulation according to keywords through a general search engine, and the searched correlation is poor.
In order to achieve the above purpose, the present application provides a deep learning-based method for searching rules and regulations, comprising the following steps:
in a first aspect, the invention provides a deep learning-based regulatory search method, which comprises the following specific steps:
s1, acquiring a query text provided by a user, and inputting the query text into a Chinese text word segmentation model to obtain each target word segmentation in the query text; and inputting each target word segmentation into the Chinese text dependency syntactic analysis model to obtain the part of speech and the attribute of each target word segmentation. And screening the target participles according to the part of speech and the attribute of each target participle.
S2, searching in the regulation database to obtain a plurality of search results, and calculating the matching degree X of each search result based on the word segmentationnAnd screening N retrieval results meeting the requirements.
2-1, searching a plurality of preliminary search results according to the original query text and the target participles screened out in the step S1. The preliminary search results each include a document-content portion and a document-title portion. The document-content part is a specific content part of the search result. The document-title is a title or a subtitle of a paragraph to which the search result belongs. The preliminary search results are input into the chinese text segmentation model and the chinese text dependency parsing model described in step S1. And obtaining the target word segmentation in each preliminary retrieval result and the part of speech and the attribute of the target word segmentation.
2-2, respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-content part of each preliminary search result into an unsupervised matching algorithm to obtain the basic matching degree A between the query text and each preliminary search resultn;
Respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-title part of each preliminary search result into a Jaccard similarity matching algorithm to obtain an additional matching degree B between the query text and each preliminary search resultn。
2-3, respectively calculating matching degree X between the query text and each preliminary retrieval result based on the participlen=c·An+(0.5-c)·Bn(ii) a Wherein c is a first weight coefficient, and the value range of c is 0-0.5. According to matching degree X based on word segmentationnAnd screening a plurality of search results based on the word segmentation.
S3, respectively calculating the matching degree Y based on complete semantics between the query text and each search result screened out in the step S2 based on the participle by utilizing a Bert-ESIM modeln. The Bert-ESIM text similarity calculation model comprises an improved ESIM network. The modified ESIM network uses a cosine similarity calculator instead of the Softmax component. A Bert chinese text feature extractor is used instead of the input encoder.
S4, respectively calculating the composite matching degree Z of the N retrieval results and the query textn=d·Xn+(0.5-d)·Yn(ii) a Wherein d is a second weight coefficient, and the value range of d is 0-0.5. According to a composite matching degree ZnAnd sequencing and outputting the N retrieval results from large to small.
Preferably, the attributes of the target participle include subject, predicate, object, and complement. The part of speech of the target participle comprises nouns, verbs, adjectives, adverbs, conjunctions, entity words, prepositions, quantitative words, names of people, place names and time;
preferably, in step S1, target participles belonging to a subject, a predicate, an object, an entity, a time, a place, or a quantifier are retained.
Preferably, the Chinese text word segmentation model adopts a combination network of a multi-layer Bi-GRU network and a CRF network. The Chinese text word segmentation model is obtained by training a Chinese word segmentation data set comprising cwb2-data, a people daily data set, SIGHAN Bakeoff2005 and a MSRA Microsoft Asian institute data set. The input of the Chinese text word segmentation model is a Chinese text, and the output is each target word segmentation in the Chinese text and the attribute and the part of speech of each target word segmentation.
Preferably, the Chinese text dependency syntax analysis model adopts a combined network of a Bi-layer LSTM network and an MLP network. The Chinese text dependency syntax analysis model is obtained by training a Chinese dependency syntax analysis data set comprising SemEval-2016, CoNLL, Penn Treebank and Baidu open source data set; the Chinese text dependency syntactic analysis model inputs the target participle and outputs the part of speech and the attribute of the target participle in the query text.
Preferably, in step 2-1, the parts of the target participles extracted from each preliminary search result, which belong to prepositions, fictional words and pronouns, are screened out.
Preferably, the regulatory database described in step S2 includes: the rules and regulations data obtained by scanning physical rules and regulations books, and the laws and regulations obtained by web crawlers. The local entity regulation book obtains unstructured picture data after scanning. Converting unstructured picture data into structured regulation data by using an OCR character recognition model; the OCR character recognition model is composed of a text detection model and a text recognition model, wherein a main network of the text detection model adopts MobileNet-small-50. The text recognition model adopts a combined network of a Bi-layer LSTM network and a CTC network. The OCR character recognition model takes ICDAR2019-LSVT, ICDAR2017-RCTW-17, Chinese street view character recognition, Chinese document character recognition and ICDAR2019-ArT as a training set and a test set; the input of the OCR character recognition model is a picture, and the output is the character content in the picture and the coordinates of characters.
Preferably, the Bert-ESIM text similarity calculation model adopts Chinese text matching data sets including CCKS2018, Chinese SNLI MultiNLI, LCQMC, OCNLI and XNLI as a training set and a test set.
Preferably, the Bert-ESIM text similarity calculation model includes a Transformer model, a Bert model, an ESIM model, and a cosine similarity calculator. The Bert-ESIM text similarity calculation model inputs text pairs and outputs complete semantic-based matching degree Y of the text pairsn。
The method for acquiring the Bert-ESIM text similarity calculation model specifically comprises the following steps:
calling each layer of weight parameters of the Bert model.
Initializing each layer of weight parameters in the Bert model to obtain the Bert Chinese text feature extractor.
And thirdly, replacing an input encoder in the ESIM network by adopting a Bert Chinese text feature extractor.
And fourthly, replacing the Softmax component in the ESIM network by adopting a cosine similarity calculator to obtain the Bert-ESIM semi-pre-training network.
And using the training set and the test set to perform fine adjustment, training and testing on the Bert-ESIM semi-pre-training network to obtain a Bert-ESIM text similarity calculation model.
In a second aspect, the invention provides a deep learning-based system and regulation retrieval system, which comprises a query text receiving module, a system and regulation document uploading and processing module, a system and regulation text splitting and warehousing module, a crawler module, an algorithm module and a system and regulation retrieval and display module.
The query text receiving module is used for receiving the query text input by the user and carrying out basic processing on the query text. The basic processing comprises the steps of segmenting the query text and obtaining the part of speech and the attribute of the segmentation.
The system of regulation document uploading and processing module is used for receiving and processing the system of regulation documents with different structures uploaded by a user.
The system text splitting and warehousing module is used for splitting chapters and sections of structured system texts, integrating text information of each natural section and warehousing the finally standardized texts.
The crawler module is used for collecting legal and legal rules texts disclosed in the Internet.
The algorithm module is used for analyzing the query text, acquiring detailed information of the retrieval text and converting the unstructured data into the structured text. The algorithm module comprises a Chinese text word segmentation algorithm, a Chinese text dependency syntax analysis algorithm, an OCR character recognition algorithm, a Bert-ESIM text similarity calculation algorithm, a BM25 algorithm and a TF-IDF algorithm.
The rule system retrieval and display module is used for integrating the query text receiving module and the algorithm module to obtain a required retrieval result and displaying the inverted retrieval result to a user in a Web page mode.
The invention has the following beneficial effects:
1. the invention introduces ESIM network when calculating text similarity. Replacing the Softmax component with a cosine similarity calculator in the modified ESIM network; the Bert Chinese text feature extractor is used instead of the input encoder of the original network. Compared with a Softmax component, the cosine similarity calculator has the core idea that the similarity of two vectors is measured by using a cosine value of an included angle theta of the two vectors, and the Softmax component is usually used for multi-classification and needs to fuse the two vectors into one vector as input, so that the difference between the vectors is weakened, and the similarity of texts can be better calculated by adopting the cosine similarity calculator. Meanwhile, compared with an input encoder of an original network, the Bert Chinese text feature extractor has the advantages that model parameters trained on a large amount of Chinese text data can be used as initial parameters of a Bert Chinese text feature extractor model in the text in a migration training mode, and then the whole Bert-ESIM model is finely adjusted by utilizing a self-built regulation data set; therefore, on one hand, when a more complex encoder (Bert) is adopted, the extraction effect of the features can be improved, and meanwhile, the training time and the calculation time of the model can be effectively controlled.
2. The invention provides data support for the retrieval of the regulation and the regulation through the self-built regulation and regulation database, and simultaneously provides an uploading interface of the document, thereby facilitating the uploading of relevant internal regulations and regulations of users (enterprises, public institutions and the like), and improving the pertinence and the matching rate of the regulation and the regulation retrieval; based on deep learning, a Chinese text word segmentation model, a Chinese text dependency syntactic analysis model, an OCR character recognition model and a Bert-ESIM text similarity calculation model are realized, a method is provided for converting unstructured data into structured texts, detailed information of query texts is provided for retrieval of subsequent regulations, matching between the query texts and retrieval results is performed on the basis of word segmentation and semantics, the matching effect of the regulation retrieval is improved, and intelligent retrieval aiming at the regulations is finally realized.
Drawings
Fig. 1 is a schematic flow chart of a regulatory search method provided in embodiment 1 of the present invention;
fig. 2 is a block diagram schematically illustrating the structure of the regulatory search system according to embodiment 2 of the present invention.
Detailed Description
In order to make the purpose, technical solution and system structure of the present invention more clearly understood, the present invention will be further described in detail with reference to the accompanying drawings and embodiments. The specific embodiments described herein are merely illustrative of the invention and the scope of the invention is not limited to the following.
Example 1
As shown in fig. 1, the present embodiment provides a deep learning-based method for retrieving a regulation system, which aims to solve the technical problem that it is difficult for people to accurately obtain the specific content of the corresponding regulation system according to keywords by using a general search engine, and specifically includes the following steps:
s1, acquiring a query text provided by a user, and inputting the query text into a Chinese text word segmentation model and a Chinese text dependency syntactic analysis model to obtain each target word in the query text and the part of speech and the attribute of each target word; attributes of the target participle include subject, predicate, object, and complement. The part of speech includes noun, verb, adjective, adverb, conjunctive, entity word, preposition word, quantitative word, name of person, place name and time; and screening the target participles according to the part-of-speech and the attributes of each target participle, and reserving a subject, a predicate, an object, an entity word, time, a place and a quantifier.
The Chinese text word segmentation model adopts a combination network of a multi-layer (three-layer) Bi-GRU network and a CRF network. The Chinese text word segmentation model is obtained by training a Chinese word segmentation data set comprising cwb2-data, a people daily data set, SIGHAN Bakeoff2005 and a MSRA Microsoft Asian institute data set. The input of the Chinese text word segmentation model is a conventional Chinese text, and the output is each target word segmentation in the Chinese text and the attribute and the part of speech (namely, noun, verb, adjective, adverb, conjunctive, entity word, preposition, quantitative word, name of a person, place name and time) of each target word segmentation.
The Chinese text dependency syntax analysis model adopts a combined network of a Bi-layer LSTM network and an MLP network. The Chinese text dependency syntax analysis model is obtained by training a Chinese dependency syntax analysis data set comprising SemEval-2016, CoNLL, Penn Treebank and Baidu open source data set; the input of the Chinese text dependency syntax analysis model is target participles (obtained by segmenting Chinese texts by the Chinese text participle model), and the output is the part of speech and the attribute of the target participles in sentences.
S2, searching in a pre-self-constructed regulation and regulation database according to the original query text and the target participles screened in the step S1 to obtain N search results and a matching degree X between each search result, the query text and each target participle based on the participlesnN is less than or equal to 100, and the specific process is as follows:
2-1, searching a plurality of preliminary search results according to the original query text and the target participles screened out in the step S1. The preliminary search result includes a document-content portion (i.e., a specific content portion of the document to be searched) and a document-title portion (i.e., a subtitle of a paragraph to which the document to be searched belongs).
The document-content part and the document-title part of each preliminary search result are respectively input into the Chinese text participle model described in the step S1. And obtaining the target word segmentation in each preliminary retrieval result and the part of speech and the attribute of the target word segmentation. And screening out the parts of prepositions, fictional words and pronouns in the target participles extracted from each preliminary retrieval result.
2-2, inputting the target participles of the query text query screened in the step S1 and the target participles extracted from the document-content part of each preliminary search result into a traditional unsupervised matching algorithm BM25 or TF-IDF (the vocabulary of the TF-IDF algorithm is obtained in a self-constructed regulation database), and obtaining the basic matching degree A between the query text query and each preliminary search result documentn;
Inputting the target participles of the query text query screened in the step S1 and the target participles extracted from the document-title part of each preliminary search result into a Jaccard similarity matching algorithm to obtain an additional matching degree B between the query text query and each preliminary search result documentn。
2-3, obtaining the basic matching degree A according to the calculation in the step 2-2nAnd an additional degree of matching BnRespectively calculating the matching degree X between the query text query and each preliminary search result document based on the word segmentation by using a weighted distribution algorithmn=c·An+(0.5-c)·Bn(ii) a Wherein c is a first weight coefficient, the value range of c is 0-0.5, and specific numerical values are searched according to actual conditionsThe emphasis and requirements of. According to matching degree X based on word segmentationnAnd screening N optimal search results documents based on the word segmentation from large to small, wherein N is less than or equal to 100.
A self-constructed regulatory database includes: the rules and regulations data obtained by scanning local entity rules and regulations books and the open laws and regulations are obtained through a web crawler. The method comprises the steps that a large amount of relevant unstructured picture data are obtained after local entity regulation books are scanned, and an OCR character recognition model is used for converting the picture data into structured regulation data; the OCR character recognition model is composed of a text detection model and a text recognition model, wherein a main network of the text detection model adopts MobileNet-small-50. The text recognition model adopts a combined network of a Bi-layer LSTM network and a CTC network. The OCR character recognition model is obtained by taking ICDAR2019-LSVT, ICDAR2017-RCTW-17, Chinese street view character recognition, Chinese document character recognition, ICDAR2019-ArT and partially synthesized data as a training set and a test set; the input of the OCR character recognition model is a picture, and the output is the character content in the picture and the coordinates of characters.
S3, respectively calculating text similarity (short text-long text) between the original query text query input by the user and the document-content parts of the N search results based on the word segmentation screened in the step S2 by utilizing a Bert-ESIM model to obtain matching degree Y based on complete semantics between each of the N search results based on the word segmentation and the original query textnThe specific process is as follows:
the main network of the Bert-ESIM text similarity calculation model consists of a Transformer model, a Bert model, an ESIM model and a cosine similarity calculator. The Bert-ESIM text similarity calculation model adopts an open-source Chinese text matching data set comprising CCKS2018, Chinese SNLI MultiNLI, LCQMC, OCNLI and XNLI as a training set and a test set; after training and testing, a usable model is finally obtained. The Bert-ESIM text similarity calculation model inputs a text pair, specifically a text pair consisting of a query text query and a document-content part of a search result, and outputs a complete language-based text pair between the query text query and the document-content part of the search resultDegree of semantic matching Yn。
The method for acquiring the Bert-ESIM text similarity calculation model specifically comprises the following steps:
calling each layer of weight parameters of the Bert model based on a large amount of Chinese texts.
Initializing each layer of weight parameters in the Bert model to obtain the Bert Chinese text feature extractor.
And thirdly, replacing an input encoding part in the ESIM network by adopting a Bert Chinese text feature extractor.
And fourthly, replacing the Softmax component in the ESIM network by adopting a cosine similarity calculator to obtain the Bert-ESIM semi-pre-training network.
And using the training set and the test set to perform fine adjustment, training and testing on the Bert-ESIM semi-pre-training network to obtain a Bert-ESIM text similarity calculation model.
In the Bert-ESIM text similarity calculation model, a basic component for feature extraction is a transform component (mainly an Encoder-Decoder structure), a 12-layer transform component is used for forming a Bert Chinese text feature extractor, then an input encoding part of the original ESIM is replaced by the Bert, and finally a cosine similarity calculator is used for replacing a Softmax component. Because the Bert network is complex and has a lot of parameters, the transfer learning is adopted in the specific implementation process, and then the whole Bert-ESIM network is subjected to fine tuning and training by using a Chinese text matching data set so as to achieve the optimal effect.
S4, respectively calculating the composite matching degrees Z of the N search result documents and the query text based on the word segmentationn=d·Xn+(0.5-d)·Yn(ii) a And d is a second weight coefficient, the value range of d is 0-0.5, and the specific numerical value is determined according to the emphasis point and the requirement during actual retrieval. According to a composite matching degree ZnAnd sequencing the N retrieval results from large to small, returning the N retrieval results to the Web front end according to the well-arranged sequence, and displaying the N retrieval results to the user.
Example 2
As shown in fig. 2, a deep learning based regulatory search system, the regulatory search system comprising:
the query text receiving module: the system is used for receiving query texts input by a user and performing basic processing on the query texts. The basic processing comprises the steps of performing word segmentation and dependency syntax analysis on the query text to obtain target words of the query text and the part of speech and attributes of the target words.
A regulation and regulation document uploading and processing module: and the system and method are used for receiving and processing different-structure (TXT, PDF, picture and the like) regulation documents uploaded by a user. Meanwhile, the module converts unstructured data (such as pictures) into structured text data by using an OCR character recognition interface.
The system text splitting and warehousing module comprises: the method is used for splitting chapters and sections of the structured regulation text, integrating text information of each natural section (the content of the natural section, the chapter to which the natural section belongs, and the subtitle of the section or chapter closest to the natural section), and finally warehousing the text after standardization.
A crawler module: the system is used for collecting the legal and legal texts disclosed in the Internet and providing a data source for the construction of a regulation database. The module mainly collects data of certain specific websites to obtain corresponding legal and legal data.
An algorithm module: the module is used for analyzing the query text, acquiring detailed information of the retrieval text and converting unstructured data into structured text, and comprises a Chinese text word segmentation algorithm, a Chinese text dependency syntax analysis algorithm, an OCR character recognition algorithm, a Bert-ESIM text similarity calculation algorithm, a BM25 algorithm and a TF-IDF algorithm.
A rule system retrieval and display module: and the query text receiving module and the algorithm module are integrated to obtain a required retrieval result, and the inverted retrieval result is displayed to a user in a Web page mode.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A deep learning-based regulation retrieval method is characterized in that: s1, acquiring a query text provided by a user, and inputting the query text into a Chinese text word segmentation model to obtain each target word segmentation in the query text; inputting each target word segmentation into a Chinese text dependency syntax analysis model to obtain the part of speech and the attribute of each target word segmentation; screening the target participles according to the part of speech and the attributes of each target participle;
s2, searching in the regulation database to obtain a plurality of search results, and calculating the matching degree X of each search result based on the word segmentationnThen, screening N retrieval results meeting the requirements;
2-1, retrieving a plurality of preliminary retrieval results according to the original query text and the target participles screened in the step S1; the preliminary search results both comprise a document-content part and a document-title part; the document-content part is a specific content part of the search result; document-title is the title or subtitle of the paragraph to which the search result belongs; inputting each preliminary search result into the Chinese text word segmentation model and the Chinese text dependency syntax analysis model in the step S1; obtaining target word segmentation in each preliminary retrieval result and the part of speech and the attribute of the target word segmentation;
2-2, respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-content part of each preliminary search result into an unsupervised matching algorithm to obtain the basic matching degree A between the query text and each preliminary search resultn;
Respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-title part of each preliminary search result into a Jaccard similarity matching algorithm to obtain an additional matching degree B between the query text and each preliminary search resultn;
2-3, respectively calculating matching degree X between the query text and each preliminary retrieval result based on the participlen=c·An+(0.5-c)·Bn(ii) a Wherein c is a first weight coefficient, and the value range of c is 0-0.5; according to matching degree X based on word segmentationnScreening out a plurality of search results based on the word segmentation;
s3, respectively calculating the matching degree Y based on complete semantics between the query text and each search result screened out in the step S2 based on the participle by utilizing a Bert-ESIM modeln(ii) a The Bert-ESIM text similarity calculation model comprises an improved ESIM network; replacing the Softmax component with a cosine similarity calculator in the modified ESIM network; replacing the input encoder with a Bert chinese text feature extractor;
s4, respectively calculating the composite matching degree Z of the N retrieval results and the query textn=d·Xn+(0.5-d)·Yn(ii) a Wherein d is a second weight coefficient, and the value range of d is 0-0.5; according to a composite matching degree ZnAnd sequencing and outputting the N retrieval results from large to small.
2. The deep learning-based regulatory search method of claim 1, wherein: the attributes of the target participles comprise subjects, predicates, objects, determinants, subjects and complements; the part of speech of the target participle comprises nouns, verbs, adjectives, adverbs, conjunctions, entity words, prepositions, quantitative words, names of people, place names and time.
3. The deep learning-based regulatory search method of claim 1, wherein: in step S1, target participles belonging to the subject, predicate, object, entity, time, place, or quantifier are retained.
4. The deep learning-based regulatory search method of claim 1, wherein: the Chinese text word segmentation model adopts a combination network of a multi-layer Bi-GRU network and a CRF network; the Chinese text word segmentation model is obtained by training a Chinese word segmentation data set comprising cwb2-data, a people daily data set, SIGHANBAKEOFF2005 and a MSRA Microsoft Asian institute data set; the input of the Chinese text word segmentation model is a Chinese text, and the output is each target word segmentation in the Chinese text and the attribute and the part of speech of each target word segmentation.
5. The deep learning-based regulatory search method of claim 1, wherein: the Chinese text dependency syntax analysis model adopts a combined network of a double-layer Bi-LSTM network and an MLP network; the Chinese text dependency syntax analysis model is obtained by training a Chinese dependency syntax analysis data set comprising SemEval-2016, CoNLL, PennTreebank and Baidu open source data set; the Chinese text dependency syntactic analysis model inputs the target participle and outputs the part of speech and the attribute of the target participle in the query text.
6. The deep learning-based regulatory search method of claim 1, wherein: in the step 2-1, the parts of prepositions, fictional words and pronouns in the target participles extracted from each preliminary retrieval result are screened out.
7. The deep learning-based regulatory search method of claim 1, wherein: the regulation database described in step S2 includes: obtaining regulation and regulation data by scanning an entity regulation and regulation book and laws and regulations obtained by a web crawler; the method comprises the steps that local entity regulation books obtain unstructured picture data after scanning; converting unstructured picture data into structured regulation data by using an OCR character recognition model; the OCR character recognition model is composed of a text detection model and a text recognition model, wherein a main network of the text detection model adopts MobileNet-small-50; the text recognition model adopts a combined network of a double-layer Bi-LSTM network and a CTC network; the OCR character recognition model takes ICDAR2019-LSVT, ICDAR2017-RCTW-17, Chinese street view character recognition, Chinese document character recognition and ICDAR2019-ArT as a training set and a test set; the input of the OCR character recognition model is a picture, and the output is the character content in the picture and the coordinates of characters.
8. The deep learning-based regulatory search method of claim 1, wherein: the Bert-ESIM text similarity calculation model adopts Chinese text matching data sets including CCKS2018, Chinese SNLIMultiNLI, LCQMC, OCNLI and XNLI as a training set and a test set.
9. The deep learning-based regulatory search method of claim 1, wherein: the Bert-ESIM text similarity calculation model comprises a Transformer model, a Bert model, an ESIM model and a cosine similarity calculator; the Bert-ESIM text similarity calculation model inputs text pairs and outputs complete semantic-based matching degree Y of the text pairsn;
The method for acquiring the Bert-ESIM text similarity calculation model specifically comprises the following steps:
calling each layer of weight parameters of the Bert model;
initializing each layer of weight parameters in the Bert model to obtain a Bert Chinese text feature extractor;
replacing an input encoder in the ESIM network by adopting a Bert Chinese text feature extractor;
replacing a Softmax component in the ESIM network by adopting a cosine similarity calculator to obtain a Bert-ESIM semi-pre-training network;
and using the training set and the test set to perform fine adjustment, training and testing on the Bert-ESIM semi-pre-training network to obtain a Bert-ESIM text similarity calculation model.
10. A deep learning-based regulation and regulation retrieval system comprises a query text receiving module, a regulation and regulation document uploading and processing module, a regulation and regulation text splitting and warehousing module, a crawler module, an algorithm module and a regulation and regulation retrieval and display module; the method is characterized in that: the query text receiving module is used for receiving a query text input by a user and carrying out basic processing on the query text; the basic processing comprises the steps of segmenting the query text, and acquiring the part of speech and the attribute of the segmentation;
the system document uploading and processing module is used for receiving and processing the system documents of different structures uploaded by the user;
the system text splitting and warehousing module is used for splitting chapters and sections of structured system texts, integrating text information of each natural section and warehousing the finally standardized texts;
the crawler module is used for collecting legal and legal rules texts disclosed in the Internet;
the algorithm module is used for analyzing the query text, acquiring detailed information of the retrieval text and converting unstructured data into a structured text; the algorithm module comprises a Chinese text word segmentation algorithm, a Chinese text dependency syntax analysis algorithm, an OCR character recognition algorithm, a Bert-ESIM text similarity calculation algorithm, a BM25 algorithm and a TF-IDF algorithm;
the rule system retrieval and display module is used for integrating the query text receiving module and the algorithm module to obtain a required retrieval result and displaying the inverted retrieval result to a user in a Web page mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110686425.2A CN113535936B (en) | 2021-06-21 | 2021-06-21 | Deep learning-based regulation system retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110686425.2A CN113535936B (en) | 2021-06-21 | 2021-06-21 | Deep learning-based regulation system retrieval method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113535936A true CN113535936A (en) | 2021-10-22 |
CN113535936B CN113535936B (en) | 2024-02-13 |
Family
ID=78125503
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110686425.2A Active CN113535936B (en) | 2021-06-21 | 2021-06-21 | Deep learning-based regulation system retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113535936B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114676306A (en) * | 2022-03-28 | 2022-06-28 | 河南经贸职业学院 | Computer information sieving mechanism based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020052871A1 (en) * | 2000-11-02 | 2002-05-02 | Simpleact Incorporated | Chinese natural language query system and method |
WO2015080561A1 (en) * | 2013-11-27 | 2015-06-04 | Mimos Berhad | A method and system for automated relation discovery from texts |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109871468A (en) * | 2019-02-01 | 2019-06-11 | 国网四川省电力公司广元供电公司 | Non-structured document management and rules and regulations entry management integration system |
-
2021
- 2021-06-21 CN CN202110686425.2A patent/CN113535936B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020052871A1 (en) * | 2000-11-02 | 2002-05-02 | Simpleact Incorporated | Chinese natural language query system and method |
WO2015080561A1 (en) * | 2013-11-27 | 2015-06-04 | Mimos Berhad | A method and system for automated relation discovery from texts |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109871468A (en) * | 2019-02-01 | 2019-06-11 | 国网四川省电力公司广元供电公司 | Non-structured document management and rules and regulations entry management integration system |
Non-Patent Citations (2)
Title |
---|
刘玉林;郭雅娟;陈锦铭;陈昊;: "基于自然语言处理技术的电网招标资料查重系统研制", 电力信息与通信技术, no. 05, 15 May 2018 (2018-05-15) * |
张达夫;: "基于依存关系匹配的长难查询处理", 电脑知识与技术, no. 19, 5 July 2012 (2012-07-05) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114676306A (en) * | 2022-03-28 | 2022-06-28 | 河南经贸职业学院 | Computer information sieving mechanism based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN113535936B (en) | 2024-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109684448B (en) | Intelligent question and answer method | |
US9613024B1 (en) | System and methods for creating datasets representing words and objects | |
Wang et al. | Multilayer dense attention model for image caption | |
CN108763333A (en) | A kind of event collection of illustrative plates construction method based on Social Media | |
Zubrinic et al. | The automatic creation of concept maps from documents written using morphologically rich languages | |
JP2008165598A (en) | Apparatus and method for extracting rumor information | |
JP4911599B2 (en) | Reputation information extraction device and reputation information extraction method | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
Wu et al. | Community answer generation based on knowledge graph | |
CN111984782A (en) | Method and system for generating text abstract of Tibetan language | |
Manuel et al. | Automatic text summarization | |
Yusuf et al. | Query expansion method for quran search using semantic search and lucene ranking | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
Yilahun et al. | Entity extraction based on the combination of information entropy and TF-IDF | |
Islam et al. | Applications of corpus-based semantic similarity and word segmentation to database schema matching | |
CN113535936B (en) | Deep learning-based regulation system retrieval method and system | |
Zhang et al. | Chinese-English mixed text normalization | |
Bae et al. | Improving question retrieval in community question answering service using dependency relations and question classification | |
Yang et al. | Automatic text summarization for government news reports based on multiple features | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Wu et al. | Short text similarity calculation based on jaccard and semantic mixture | |
Shao et al. | An efficient expansion word extraction algorithm for educational video | |
Nishy Reshmi et al. | Textual entailment classification using syntactic structures and semantic relations | |
Vickers | Ontology-based free-form query processing for the semantic web | |
Xue et al. | Sentiment analysis based on weibo comments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |