CN113535936A - Deep learning-based regulation and regulation retrieval method and system - Google Patents

Deep learning-based regulation and regulation retrieval method and system Download PDF

Info

Publication number
CN113535936A
CN113535936A CN202110686425.2A CN202110686425A CN113535936A CN 113535936 A CN113535936 A CN 113535936A CN 202110686425 A CN202110686425 A CN 202110686425A CN 113535936 A CN113535936 A CN 113535936A
Authority
CN
China
Prior art keywords
text
regulation
model
chinese
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110686425.2A
Other languages
Chinese (zh)
Other versions
CN113535936B (en
Inventor
彭艳宏
杨攀
柯旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Chuling Data Technology Co ltd
Original Assignee
Hangzhou Chuling Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Chuling Data Technology Co ltd filed Critical Hangzhou Chuling Data Technology Co ltd
Priority to CN202110686425.2A priority Critical patent/CN113535936B/en
Publication of CN113535936A publication Critical patent/CN113535936A/en
Application granted granted Critical
Publication of CN113535936B publication Critical patent/CN113535936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a deep learning-based regulation retrieval method and a system, wherein the method comprises the following steps: 1. acquiring a query text input by a user; 2. acquiring target word segmentation of the query text and attributes of the target word segmentation; 3. constructing a regulation and regulation database; 4. according to the target word and its attribute, making search in the regulation and regulation database and calculating matching degree X based on wordn(ii) a 5. Calculating a semantic-based degree of match Yn(ii) a 6. According to XnAnd YnCalculating the composite matching degree Zn(ii) a 7. According to the composite matching degree ZnAnd inquiring the target word segmentation attributes of the text and the specific hierarchical relation in the rule system to finally obtain a plurality of inverted retrieval results. The method realizes a Chinese text word segmentation model, a Chinese text dependency syntax analysis model, an OCR character recognition model and an ESIM text similarity calculation model on the basis of deep learning, and realizes quick and accurate retrieval of a regulation system.

Description

Deep learning-based regulation and regulation retrieval method and system
Technical Field
The invention relates to the technical field of computers, in particular to a deep learning-based regulation and regulation retrieval method and system.
Background
The current regulations (national laws and regulations, provincial regulations and enterprise regulations) are so numerous that it is difficult for a general person to become familiar with the regulations and to quickly handle the regulations in some cases. The existing general search engine is not subjected to targeted optimization in the aspect of regulation retrieval, has certain deviation on semantic analysis, has poor retrieval effect, and is specifically represented by the fact that no professional comprehensive regulation database exists and the retrieval matching based on semantic hierarchy exists. Therefore, the intelligent retrieval method and system for a certain retrieval word or statement based on the existing regulatory library and deep learning are developed, and have extremely high practical significance and application value.
Disclosure of Invention
In view of the above, the present invention provides a method and system for searching a regulation based on deep learning, which aims to solve the technical problems that people are difficult to accurately obtain the specific content of the corresponding regulation according to keywords through a general search engine, and the searched correlation is poor.
In order to achieve the above purpose, the present application provides a deep learning-based method for searching rules and regulations, comprising the following steps:
in a first aspect, the invention provides a deep learning-based regulatory search method, which comprises the following specific steps:
s1, acquiring a query text provided by a user, and inputting the query text into a Chinese text word segmentation model to obtain each target word segmentation in the query text; and inputting each target word segmentation into the Chinese text dependency syntactic analysis model to obtain the part of speech and the attribute of each target word segmentation. And screening the target participles according to the part of speech and the attribute of each target participle.
S2, searching in the regulation database to obtain a plurality of search results, and calculating the matching degree X of each search result based on the word segmentationnAnd screening N retrieval results meeting the requirements.
2-1, searching a plurality of preliminary search results according to the original query text and the target participles screened out in the step S1. The preliminary search results each include a document-content portion and a document-title portion. The document-content part is a specific content part of the search result. The document-title is a title or a subtitle of a paragraph to which the search result belongs. The preliminary search results are input into the chinese text segmentation model and the chinese text dependency parsing model described in step S1. And obtaining the target word segmentation in each preliminary retrieval result and the part of speech and the attribute of the target word segmentation.
2-2, respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-content part of each preliminary search result into an unsupervised matching algorithm to obtain the basic matching degree A between the query text and each preliminary search resultn
Respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-title part of each preliminary search result into a Jaccard similarity matching algorithm to obtain an additional matching degree B between the query text and each preliminary search resultn
2-3, respectively calculating matching degree X between the query text and each preliminary retrieval result based on the participlen=c·An+(0.5-c)·Bn(ii) a Wherein c is a first weight coefficient, and the value range of c is 0-0.5. According to matching degree X based on word segmentationnAnd screening a plurality of search results based on the word segmentation.
S3, respectively calculating the matching degree Y based on complete semantics between the query text and each search result screened out in the step S2 based on the participle by utilizing a Bert-ESIM modeln. The Bert-ESIM text similarity calculation model comprises an improved ESIM network. The modified ESIM network uses a cosine similarity calculator instead of the Softmax component. A Bert chinese text feature extractor is used instead of the input encoder.
S4, respectively calculating the composite matching degree Z of the N retrieval results and the query textn=d·Xn+(0.5-d)·Yn(ii) a Wherein d is a second weight coefficient, and the value range of d is 0-0.5. According to a composite matching degree ZnAnd sequencing and outputting the N retrieval results from large to small.
Preferably, the attributes of the target participle include subject, predicate, object, and complement. The part of speech of the target participle comprises nouns, verbs, adjectives, adverbs, conjunctions, entity words, prepositions, quantitative words, names of people, place names and time;
preferably, in step S1, target participles belonging to a subject, a predicate, an object, an entity, a time, a place, or a quantifier are retained.
Preferably, the Chinese text word segmentation model adopts a combination network of a multi-layer Bi-GRU network and a CRF network. The Chinese text word segmentation model is obtained by training a Chinese word segmentation data set comprising cwb2-data, a people daily data set, SIGHAN Bakeoff2005 and a MSRA Microsoft Asian institute data set. The input of the Chinese text word segmentation model is a Chinese text, and the output is each target word segmentation in the Chinese text and the attribute and the part of speech of each target word segmentation.
Preferably, the Chinese text dependency syntax analysis model adopts a combined network of a Bi-layer LSTM network and an MLP network. The Chinese text dependency syntax analysis model is obtained by training a Chinese dependency syntax analysis data set comprising SemEval-2016, CoNLL, Penn Treebank and Baidu open source data set; the Chinese text dependency syntactic analysis model inputs the target participle and outputs the part of speech and the attribute of the target participle in the query text.
Preferably, in step 2-1, the parts of the target participles extracted from each preliminary search result, which belong to prepositions, fictional words and pronouns, are screened out.
Preferably, the regulatory database described in step S2 includes: the rules and regulations data obtained by scanning physical rules and regulations books, and the laws and regulations obtained by web crawlers. The local entity regulation book obtains unstructured picture data after scanning. Converting unstructured picture data into structured regulation data by using an OCR character recognition model; the OCR character recognition model is composed of a text detection model and a text recognition model, wherein a main network of the text detection model adopts MobileNet-small-50. The text recognition model adopts a combined network of a Bi-layer LSTM network and a CTC network. The OCR character recognition model takes ICDAR2019-LSVT, ICDAR2017-RCTW-17, Chinese street view character recognition, Chinese document character recognition and ICDAR2019-ArT as a training set and a test set; the input of the OCR character recognition model is a picture, and the output is the character content in the picture and the coordinates of characters.
Preferably, the Bert-ESIM text similarity calculation model adopts Chinese text matching data sets including CCKS2018, Chinese SNLI MultiNLI, LCQMC, OCNLI and XNLI as a training set and a test set.
Preferably, the Bert-ESIM text similarity calculation model includes a Transformer model, a Bert model, an ESIM model, and a cosine similarity calculator. The Bert-ESIM text similarity calculation model inputs text pairs and outputs complete semantic-based matching degree Y of the text pairsn
The method for acquiring the Bert-ESIM text similarity calculation model specifically comprises the following steps:
calling each layer of weight parameters of the Bert model.
Initializing each layer of weight parameters in the Bert model to obtain the Bert Chinese text feature extractor.
And thirdly, replacing an input encoder in the ESIM network by adopting a Bert Chinese text feature extractor.
And fourthly, replacing the Softmax component in the ESIM network by adopting a cosine similarity calculator to obtain the Bert-ESIM semi-pre-training network.
And using the training set and the test set to perform fine adjustment, training and testing on the Bert-ESIM semi-pre-training network to obtain a Bert-ESIM text similarity calculation model.
In a second aspect, the invention provides a deep learning-based system and regulation retrieval system, which comprises a query text receiving module, a system and regulation document uploading and processing module, a system and regulation text splitting and warehousing module, a crawler module, an algorithm module and a system and regulation retrieval and display module.
The query text receiving module is used for receiving the query text input by the user and carrying out basic processing on the query text. The basic processing comprises the steps of segmenting the query text and obtaining the part of speech and the attribute of the segmentation.
The system of regulation document uploading and processing module is used for receiving and processing the system of regulation documents with different structures uploaded by a user.
The system text splitting and warehousing module is used for splitting chapters and sections of structured system texts, integrating text information of each natural section and warehousing the finally standardized texts.
The crawler module is used for collecting legal and legal rules texts disclosed in the Internet.
The algorithm module is used for analyzing the query text, acquiring detailed information of the retrieval text and converting the unstructured data into the structured text. The algorithm module comprises a Chinese text word segmentation algorithm, a Chinese text dependency syntax analysis algorithm, an OCR character recognition algorithm, a Bert-ESIM text similarity calculation algorithm, a BM25 algorithm and a TF-IDF algorithm.
The rule system retrieval and display module is used for integrating the query text receiving module and the algorithm module to obtain a required retrieval result and displaying the inverted retrieval result to a user in a Web page mode.
The invention has the following beneficial effects:
1. the invention introduces ESIM network when calculating text similarity. Replacing the Softmax component with a cosine similarity calculator in the modified ESIM network; the Bert Chinese text feature extractor is used instead of the input encoder of the original network. Compared with a Softmax component, the cosine similarity calculator has the core idea that the similarity of two vectors is measured by using a cosine value of an included angle theta of the two vectors, and the Softmax component is usually used for multi-classification and needs to fuse the two vectors into one vector as input, so that the difference between the vectors is weakened, and the similarity of texts can be better calculated by adopting the cosine similarity calculator. Meanwhile, compared with an input encoder of an original network, the Bert Chinese text feature extractor has the advantages that model parameters trained on a large amount of Chinese text data can be used as initial parameters of a Bert Chinese text feature extractor model in the text in a migration training mode, and then the whole Bert-ESIM model is finely adjusted by utilizing a self-built regulation data set; therefore, on one hand, when a more complex encoder (Bert) is adopted, the extraction effect of the features can be improved, and meanwhile, the training time and the calculation time of the model can be effectively controlled.
2. The invention provides data support for the retrieval of the regulation and the regulation through the self-built regulation and regulation database, and simultaneously provides an uploading interface of the document, thereby facilitating the uploading of relevant internal regulations and regulations of users (enterprises, public institutions and the like), and improving the pertinence and the matching rate of the regulation and the regulation retrieval; based on deep learning, a Chinese text word segmentation model, a Chinese text dependency syntactic analysis model, an OCR character recognition model and a Bert-ESIM text similarity calculation model are realized, a method is provided for converting unstructured data into structured texts, detailed information of query texts is provided for retrieval of subsequent regulations, matching between the query texts and retrieval results is performed on the basis of word segmentation and semantics, the matching effect of the regulation retrieval is improved, and intelligent retrieval aiming at the regulations is finally realized.
Drawings
Fig. 1 is a schematic flow chart of a regulatory search method provided in embodiment 1 of the present invention;
fig. 2 is a block diagram schematically illustrating the structure of the regulatory search system according to embodiment 2 of the present invention.
Detailed Description
In order to make the purpose, technical solution and system structure of the present invention more clearly understood, the present invention will be further described in detail with reference to the accompanying drawings and embodiments. The specific embodiments described herein are merely illustrative of the invention and the scope of the invention is not limited to the following.
Example 1
As shown in fig. 1, the present embodiment provides a deep learning-based method for retrieving a regulation system, which aims to solve the technical problem that it is difficult for people to accurately obtain the specific content of the corresponding regulation system according to keywords by using a general search engine, and specifically includes the following steps:
s1, acquiring a query text provided by a user, and inputting the query text into a Chinese text word segmentation model and a Chinese text dependency syntactic analysis model to obtain each target word in the query text and the part of speech and the attribute of each target word; attributes of the target participle include subject, predicate, object, and complement. The part of speech includes noun, verb, adjective, adverb, conjunctive, entity word, preposition word, quantitative word, name of person, place name and time; and screening the target participles according to the part-of-speech and the attributes of each target participle, and reserving a subject, a predicate, an object, an entity word, time, a place and a quantifier.
The Chinese text word segmentation model adopts a combination network of a multi-layer (three-layer) Bi-GRU network and a CRF network. The Chinese text word segmentation model is obtained by training a Chinese word segmentation data set comprising cwb2-data, a people daily data set, SIGHAN Bakeoff2005 and a MSRA Microsoft Asian institute data set. The input of the Chinese text word segmentation model is a conventional Chinese text, and the output is each target word segmentation in the Chinese text and the attribute and the part of speech (namely, noun, verb, adjective, adverb, conjunctive, entity word, preposition, quantitative word, name of a person, place name and time) of each target word segmentation.
The Chinese text dependency syntax analysis model adopts a combined network of a Bi-layer LSTM network and an MLP network. The Chinese text dependency syntax analysis model is obtained by training a Chinese dependency syntax analysis data set comprising SemEval-2016, CoNLL, Penn Treebank and Baidu open source data set; the input of the Chinese text dependency syntax analysis model is target participles (obtained by segmenting Chinese texts by the Chinese text participle model), and the output is the part of speech and the attribute of the target participles in sentences.
S2, searching in a pre-self-constructed regulation and regulation database according to the original query text and the target participles screened in the step S1 to obtain N search results and a matching degree X between each search result, the query text and each target participle based on the participlesnN is less than or equal to 100, and the specific process is as follows:
2-1, searching a plurality of preliminary search results according to the original query text and the target participles screened out in the step S1. The preliminary search result includes a document-content portion (i.e., a specific content portion of the document to be searched) and a document-title portion (i.e., a subtitle of a paragraph to which the document to be searched belongs).
The document-content part and the document-title part of each preliminary search result are respectively input into the Chinese text participle model described in the step S1. And obtaining the target word segmentation in each preliminary retrieval result and the part of speech and the attribute of the target word segmentation. And screening out the parts of prepositions, fictional words and pronouns in the target participles extracted from each preliminary retrieval result.
2-2, inputting the target participles of the query text query screened in the step S1 and the target participles extracted from the document-content part of each preliminary search result into a traditional unsupervised matching algorithm BM25 or TF-IDF (the vocabulary of the TF-IDF algorithm is obtained in a self-constructed regulation database), and obtaining the basic matching degree A between the query text query and each preliminary search result documentn
Inputting the target participles of the query text query screened in the step S1 and the target participles extracted from the document-title part of each preliminary search result into a Jaccard similarity matching algorithm to obtain an additional matching degree B between the query text query and each preliminary search result documentn
2-3, obtaining the basic matching degree A according to the calculation in the step 2-2nAnd an additional degree of matching BnRespectively calculating the matching degree X between the query text query and each preliminary search result document based on the word segmentation by using a weighted distribution algorithmn=c·An+(0.5-c)·Bn(ii) a Wherein c is a first weight coefficient, the value range of c is 0-0.5, and specific numerical values are searched according to actual conditionsThe emphasis and requirements of. According to matching degree X based on word segmentationnAnd screening N optimal search results documents based on the word segmentation from large to small, wherein N is less than or equal to 100.
A self-constructed regulatory database includes: the rules and regulations data obtained by scanning local entity rules and regulations books and the open laws and regulations are obtained through a web crawler. The method comprises the steps that a large amount of relevant unstructured picture data are obtained after local entity regulation books are scanned, and an OCR character recognition model is used for converting the picture data into structured regulation data; the OCR character recognition model is composed of a text detection model and a text recognition model, wherein a main network of the text detection model adopts MobileNet-small-50. The text recognition model adopts a combined network of a Bi-layer LSTM network and a CTC network. The OCR character recognition model is obtained by taking ICDAR2019-LSVT, ICDAR2017-RCTW-17, Chinese street view character recognition, Chinese document character recognition, ICDAR2019-ArT and partially synthesized data as a training set and a test set; the input of the OCR character recognition model is a picture, and the output is the character content in the picture and the coordinates of characters.
S3, respectively calculating text similarity (short text-long text) between the original query text query input by the user and the document-content parts of the N search results based on the word segmentation screened in the step S2 by utilizing a Bert-ESIM model to obtain matching degree Y based on complete semantics between each of the N search results based on the word segmentation and the original query textnThe specific process is as follows:
the main network of the Bert-ESIM text similarity calculation model consists of a Transformer model, a Bert model, an ESIM model and a cosine similarity calculator. The Bert-ESIM text similarity calculation model adopts an open-source Chinese text matching data set comprising CCKS2018, Chinese SNLI MultiNLI, LCQMC, OCNLI and XNLI as a training set and a test set; after training and testing, a usable model is finally obtained. The Bert-ESIM text similarity calculation model inputs a text pair, specifically a text pair consisting of a query text query and a document-content part of a search result, and outputs a complete language-based text pair between the query text query and the document-content part of the search resultDegree of semantic matching Yn
The method for acquiring the Bert-ESIM text similarity calculation model specifically comprises the following steps:
calling each layer of weight parameters of the Bert model based on a large amount of Chinese texts.
Initializing each layer of weight parameters in the Bert model to obtain the Bert Chinese text feature extractor.
And thirdly, replacing an input encoding part in the ESIM network by adopting a Bert Chinese text feature extractor.
And fourthly, replacing the Softmax component in the ESIM network by adopting a cosine similarity calculator to obtain the Bert-ESIM semi-pre-training network.
And using the training set and the test set to perform fine adjustment, training and testing on the Bert-ESIM semi-pre-training network to obtain a Bert-ESIM text similarity calculation model.
In the Bert-ESIM text similarity calculation model, a basic component for feature extraction is a transform component (mainly an Encoder-Decoder structure), a 12-layer transform component is used for forming a Bert Chinese text feature extractor, then an input encoding part of the original ESIM is replaced by the Bert, and finally a cosine similarity calculator is used for replacing a Softmax component. Because the Bert network is complex and has a lot of parameters, the transfer learning is adopted in the specific implementation process, and then the whole Bert-ESIM network is subjected to fine tuning and training by using a Chinese text matching data set so as to achieve the optimal effect.
S4, respectively calculating the composite matching degrees Z of the N search result documents and the query text based on the word segmentationn=d·Xn+(0.5-d)·Yn(ii) a And d is a second weight coefficient, the value range of d is 0-0.5, and the specific numerical value is determined according to the emphasis point and the requirement during actual retrieval. According to a composite matching degree ZnAnd sequencing the N retrieval results from large to small, returning the N retrieval results to the Web front end according to the well-arranged sequence, and displaying the N retrieval results to the user.
Example 2
As shown in fig. 2, a deep learning based regulatory search system, the regulatory search system comprising:
the query text receiving module: the system is used for receiving query texts input by a user and performing basic processing on the query texts. The basic processing comprises the steps of performing word segmentation and dependency syntax analysis on the query text to obtain target words of the query text and the part of speech and attributes of the target words.
A regulation and regulation document uploading and processing module: and the system and method are used for receiving and processing different-structure (TXT, PDF, picture and the like) regulation documents uploaded by a user. Meanwhile, the module converts unstructured data (such as pictures) into structured text data by using an OCR character recognition interface.
The system text splitting and warehousing module comprises: the method is used for splitting chapters and sections of the structured regulation text, integrating text information of each natural section (the content of the natural section, the chapter to which the natural section belongs, and the subtitle of the section or chapter closest to the natural section), and finally warehousing the text after standardization.
A crawler module: the system is used for collecting the legal and legal texts disclosed in the Internet and providing a data source for the construction of a regulation database. The module mainly collects data of certain specific websites to obtain corresponding legal and legal data.
An algorithm module: the module is used for analyzing the query text, acquiring detailed information of the retrieval text and converting unstructured data into structured text, and comprises a Chinese text word segmentation algorithm, a Chinese text dependency syntax analysis algorithm, an OCR character recognition algorithm, a Bert-ESIM text similarity calculation algorithm, a BM25 algorithm and a TF-IDF algorithm.
A rule system retrieval and display module: and the query text receiving module and the algorithm module are integrated to obtain a required retrieval result, and the inverted retrieval result is displayed to a user in a Web page mode.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A deep learning-based regulation retrieval method is characterized in that: s1, acquiring a query text provided by a user, and inputting the query text into a Chinese text word segmentation model to obtain each target word segmentation in the query text; inputting each target word segmentation into a Chinese text dependency syntax analysis model to obtain the part of speech and the attribute of each target word segmentation; screening the target participles according to the part of speech and the attributes of each target participle;
s2, searching in the regulation database to obtain a plurality of search results, and calculating the matching degree X of each search result based on the word segmentationnThen, screening N retrieval results meeting the requirements;
2-1, retrieving a plurality of preliminary retrieval results according to the original query text and the target participles screened in the step S1; the preliminary search results both comprise a document-content part and a document-title part; the document-content part is a specific content part of the search result; document-title is the title or subtitle of the paragraph to which the search result belongs; inputting each preliminary search result into the Chinese text word segmentation model and the Chinese text dependency syntax analysis model in the step S1; obtaining target word segmentation in each preliminary retrieval result and the part of speech and the attribute of the target word segmentation;
2-2, respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-content part of each preliminary search result into an unsupervised matching algorithm to obtain the basic matching degree A between the query text and each preliminary search resultn
Respectively inputting the target participles of the query text screened in the step S1 and the target participles extracted from the document-title part of each preliminary search result into a Jaccard similarity matching algorithm to obtain an additional matching degree B between the query text and each preliminary search resultn
2-3, respectively calculating matching degree X between the query text and each preliminary retrieval result based on the participlen=c·An+(0.5-c)·Bn(ii) a Wherein c is a first weight coefficient, and the value range of c is 0-0.5; according to matching degree X based on word segmentationnScreening out a plurality of search results based on the word segmentation;
s3, respectively calculating the matching degree Y based on complete semantics between the query text and each search result screened out in the step S2 based on the participle by utilizing a Bert-ESIM modeln(ii) a The Bert-ESIM text similarity calculation model comprises an improved ESIM network; replacing the Softmax component with a cosine similarity calculator in the modified ESIM network; replacing the input encoder with a Bert chinese text feature extractor;
s4, respectively calculating the composite matching degree Z of the N retrieval results and the query textn=d·Xn+(0.5-d)·Yn(ii) a Wherein d is a second weight coefficient, and the value range of d is 0-0.5; according to a composite matching degree ZnAnd sequencing and outputting the N retrieval results from large to small.
2. The deep learning-based regulatory search method of claim 1, wherein: the attributes of the target participles comprise subjects, predicates, objects, determinants, subjects and complements; the part of speech of the target participle comprises nouns, verbs, adjectives, adverbs, conjunctions, entity words, prepositions, quantitative words, names of people, place names and time.
3. The deep learning-based regulatory search method of claim 1, wherein: in step S1, target participles belonging to the subject, predicate, object, entity, time, place, or quantifier are retained.
4. The deep learning-based regulatory search method of claim 1, wherein: the Chinese text word segmentation model adopts a combination network of a multi-layer Bi-GRU network and a CRF network; the Chinese text word segmentation model is obtained by training a Chinese word segmentation data set comprising cwb2-data, a people daily data set, SIGHANBAKEOFF2005 and a MSRA Microsoft Asian institute data set; the input of the Chinese text word segmentation model is a Chinese text, and the output is each target word segmentation in the Chinese text and the attribute and the part of speech of each target word segmentation.
5. The deep learning-based regulatory search method of claim 1, wherein: the Chinese text dependency syntax analysis model adopts a combined network of a double-layer Bi-LSTM network and an MLP network; the Chinese text dependency syntax analysis model is obtained by training a Chinese dependency syntax analysis data set comprising SemEval-2016, CoNLL, PennTreebank and Baidu open source data set; the Chinese text dependency syntactic analysis model inputs the target participle and outputs the part of speech and the attribute of the target participle in the query text.
6. The deep learning-based regulatory search method of claim 1, wherein: in the step 2-1, the parts of prepositions, fictional words and pronouns in the target participles extracted from each preliminary retrieval result are screened out.
7. The deep learning-based regulatory search method of claim 1, wherein: the regulation database described in step S2 includes: obtaining regulation and regulation data by scanning an entity regulation and regulation book and laws and regulations obtained by a web crawler; the method comprises the steps that local entity regulation books obtain unstructured picture data after scanning; converting unstructured picture data into structured regulation data by using an OCR character recognition model; the OCR character recognition model is composed of a text detection model and a text recognition model, wherein a main network of the text detection model adopts MobileNet-small-50; the text recognition model adopts a combined network of a double-layer Bi-LSTM network and a CTC network; the OCR character recognition model takes ICDAR2019-LSVT, ICDAR2017-RCTW-17, Chinese street view character recognition, Chinese document character recognition and ICDAR2019-ArT as a training set and a test set; the input of the OCR character recognition model is a picture, and the output is the character content in the picture and the coordinates of characters.
8. The deep learning-based regulatory search method of claim 1, wherein: the Bert-ESIM text similarity calculation model adopts Chinese text matching data sets including CCKS2018, Chinese SNLIMultiNLI, LCQMC, OCNLI and XNLI as a training set and a test set.
9. The deep learning-based regulatory search method of claim 1, wherein: the Bert-ESIM text similarity calculation model comprises a Transformer model, a Bert model, an ESIM model and a cosine similarity calculator; the Bert-ESIM text similarity calculation model inputs text pairs and outputs complete semantic-based matching degree Y of the text pairsn
The method for acquiring the Bert-ESIM text similarity calculation model specifically comprises the following steps:
calling each layer of weight parameters of the Bert model;
initializing each layer of weight parameters in the Bert model to obtain a Bert Chinese text feature extractor;
replacing an input encoder in the ESIM network by adopting a Bert Chinese text feature extractor;
replacing a Softmax component in the ESIM network by adopting a cosine similarity calculator to obtain a Bert-ESIM semi-pre-training network;
and using the training set and the test set to perform fine adjustment, training and testing on the Bert-ESIM semi-pre-training network to obtain a Bert-ESIM text similarity calculation model.
10. A deep learning-based regulation and regulation retrieval system comprises a query text receiving module, a regulation and regulation document uploading and processing module, a regulation and regulation text splitting and warehousing module, a crawler module, an algorithm module and a regulation and regulation retrieval and display module; the method is characterized in that: the query text receiving module is used for receiving a query text input by a user and carrying out basic processing on the query text; the basic processing comprises the steps of segmenting the query text, and acquiring the part of speech and the attribute of the segmentation;
the system document uploading and processing module is used for receiving and processing the system documents of different structures uploaded by the user;
the system text splitting and warehousing module is used for splitting chapters and sections of structured system texts, integrating text information of each natural section and warehousing the finally standardized texts;
the crawler module is used for collecting legal and legal rules texts disclosed in the Internet;
the algorithm module is used for analyzing the query text, acquiring detailed information of the retrieval text and converting unstructured data into a structured text; the algorithm module comprises a Chinese text word segmentation algorithm, a Chinese text dependency syntax analysis algorithm, an OCR character recognition algorithm, a Bert-ESIM text similarity calculation algorithm, a BM25 algorithm and a TF-IDF algorithm;
the rule system retrieval and display module is used for integrating the query text receiving module and the algorithm module to obtain a required retrieval result and displaying the inverted retrieval result to a user in a Web page mode.
CN202110686425.2A 2021-06-21 2021-06-21 Deep learning-based regulation system retrieval method and system Active CN113535936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110686425.2A CN113535936B (en) 2021-06-21 2021-06-21 Deep learning-based regulation system retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110686425.2A CN113535936B (en) 2021-06-21 2021-06-21 Deep learning-based regulation system retrieval method and system

Publications (2)

Publication Number Publication Date
CN113535936A true CN113535936A (en) 2021-10-22
CN113535936B CN113535936B (en) 2024-02-13

Family

ID=78125503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110686425.2A Active CN113535936B (en) 2021-06-21 2021-06-21 Deep learning-based regulation system retrieval method and system

Country Status (1)

Country Link
CN (1) CN113535936B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676306A (en) * 2022-03-28 2022-06-28 河南经贸职业学院 Computer information sieving mechanism based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052871A1 (en) * 2000-11-02 2002-05-02 Simpleact Incorporated Chinese natural language query system and method
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109871468A (en) * 2019-02-01 2019-06-11 国网四川省电力公司广元供电公司 Non-structured document management and rules and regulations entry management integration system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052871A1 (en) * 2000-11-02 2002-05-02 Simpleact Incorporated Chinese natural language query system and method
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109871468A (en) * 2019-02-01 2019-06-11 国网四川省电力公司广元供电公司 Non-structured document management and rules and regulations entry management integration system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘玉林;郭雅娟;陈锦铭;陈昊;: "基于自然语言处理技术的电网招标资料查重系统研制", 电力信息与通信技术, no. 05, 15 May 2018 (2018-05-15) *
张达夫;: "基于依存关系匹配的长难查询处理", 电脑知识与技术, no. 19, 5 July 2012 (2012-07-05) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676306A (en) * 2022-03-28 2022-06-28 河南经贸职业学院 Computer information sieving mechanism based on artificial intelligence

Also Published As

Publication number Publication date
CN113535936B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN109684448B (en) Intelligent question and answer method
US9613024B1 (en) System and methods for creating datasets representing words and objects
Wang et al. Multilayer dense attention model for image caption
CN108763333A (en) A kind of event collection of illustrative plates construction method based on Social Media
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
JP2008165598A (en) Apparatus and method for extracting rumor information
JP4911599B2 (en) Reputation information extraction device and reputation information extraction method
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
Wu et al. Community answer generation based on knowledge graph
CN111984782A (en) Method and system for generating text abstract of Tibetan language
Manuel et al. Automatic text summarization
Yusuf et al. Query expansion method for quran search using semantic search and lucene ranking
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
Yilahun et al. Entity extraction based on the combination of information entropy and TF-IDF
Islam et al. Applications of corpus-based semantic similarity and word segmentation to database schema matching
CN113535936B (en) Deep learning-based regulation system retrieval method and system
Zhang et al. Chinese-English mixed text normalization
Bae et al. Improving question retrieval in community question answering service using dependency relations and question classification
Yang et al. Automatic text summarization for government news reports based on multiple features
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Wu et al. Short text similarity calculation based on jaccard and semantic mixture
Shao et al. An efficient expansion word extraction algorithm for educational video
Nishy Reshmi et al. Textual entailment classification using syntactic structures and semantic relations
Vickers Ontology-based free-form query processing for the semantic web
Xue et al. Sentiment analysis based on weibo comments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant