CN115587800A - Notarization document error correction method and device, electronic device and storage medium - Google Patents

Notarization document error correction method and device, electronic device and storage medium Download PDF

Info

Publication number
CN115587800A
CN115587800A CN202211355110.0A CN202211355110A CN115587800A CN 115587800 A CN115587800 A CN 115587800A CN 202211355110 A CN202211355110 A CN 202211355110A CN 115587800 A CN115587800 A CN 115587800A
Authority
CN
China
Prior art keywords
entity
notarization
target
document
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211355110.0A
Other languages
Chinese (zh)
Inventor
陈艳
许静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Faxin Gongzhengyun Xiamen Technology Co ltd
Original Assignee
Faxin Gongzhengyun Xiamen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Faxin Gongzhengyun Xiamen Technology Co ltd filed Critical Faxin Gongzhengyun Xiamen Technology Co ltd
Priority to CN202211355110.0A priority Critical patent/CN115587800A/en
Publication of CN115587800A publication Critical patent/CN115587800A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a notarization document error correction method and device based on a knowledge graph, electronic equipment and a storage medium, and relates to the technical field of computers. Wherein, the method comprises the following steps: acquiring unknown words of the target notarization document; based on each entity in the notarization field knowledge map, carrying out fuzzy matching on the unknown words to obtain target entities having entity link relation with the unknown words; replacing the unknown words of the target notary document with the target entity. The embodiment of the application solves the problem that the efficiency of auditing the notarization documents in the related technology is low.

Description

Notarization document error correction method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of computers, in particular to a method and a device for correcting official document errors based on a knowledge graph, electronic equipment and a storage medium.
Background
The notarization document is used as a document for demonstrating notarization credibility and is endowed with national authorization certifications, so that the standardization, the logic strictness and the content accuracy of the document are checked repeatedly in a notarization mechanism to ensure accuracy and show the authority of the notarization document.
However, in the current business scenario, the document of the notarization still needs to rely on a multi-level manual review mechanism set inside the notarization institution to perform text verification and content review, and a large amount of human resources are consumed in the document verification and verification work. Moreover, manual inspection and audit highly depend on experience and knowledge of auditors, so that the quality of the issued notarization documents is uneven.
From the above, the low efficiency of the examination of the official documents becomes a problem which needs to be solved urgently.
Disclosure of Invention
Embodiments of the present application provide a method and an apparatus for correcting a notarization document error based on a knowledge graph, an electronic device, and a storage medium, which can solve the problem of low efficiency in auditing the notarization document in the related art. The technical scheme is as follows:
according to one aspect of the embodiment of the application, the notarization document error correction method based on the knowledge graph comprises the following steps: acquiring unknown words of the target notarization document; based on each entity in the notarization field knowledge map, carrying out fuzzy matching on the unknown words to obtain target entities having entity link relation with the unknown words; replacing the unknown words of the target notary document with the target entity.
According to an aspect of an embodiment of the present application, an apparatus for correcting errors of notarization documents based on knowledge graph comprises: the acquisition module is used for acquiring the unknown words of the target notarization document; the fuzzy matching module is used for carrying out fuzzy matching on the unknown words based on each entity in the public certificate domain knowledge graph to obtain target entities with entity link relation with the unknown words; and the entity replacing module is used for replacing the unknown words of the target notarization document with the target entity.
According to an aspect of an embodiment of the present application, an electronic device includes: the system comprises at least one processor, at least one memory and at least one communication bus, wherein the memory is stored with computer programs, and the processor reads the computer programs in the memory through the communication bus; the computer program when executed by the processor implements the notary document error correction method as described above.
According to an aspect of an embodiment of the present application, a storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the notary document error correction method as described above.
According to an aspect of an embodiment of the present application, a computer program product includes a computer program, the computer program is stored in a storage medium, a processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device realizes the notarization text error correction method as described above when executed.
The technical scheme provided by the application brings the beneficial effects that:
in the technical scheme, firstly, the unknown words in the target notarization document are obtained, the target entity which has entity link relation with the unknown words is searched in the notarization field knowledge map in a fuzzy matching mode, the target entity is used for replacing the unknown words in the target notarization document, and therefore error correction of the target notarization document is completed.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic illustration of an implementation environment according to the present application;
FIG. 2 is a flowchart illustrating a method for error correction of a notary document based on a knowledge-graph, in accordance with an exemplary embodiment
FIG. 3 is a flow diagram for one embodiment of step 310 in a corresponding embodiment of FIG. 2;
FIG. 4a is a block diagram of a model structure of a named entity prediction model in the corresponding embodiment of FIG. 3;
FIG. 4b is a flow diagram of a training process for a named entity predictive model in the corresponding embodiment of FIG. 3;
FIG. 5 is a flow diagram for one embodiment of step 330 in a corresponding embodiment of FIG. 2;
FIG. 6 is a flow diagram of a notary domain knowledge graph creation process in a corresponding embodiment of FIG. 2;
FIG. 7 is a flow chart for one embodiment of step 330 in the corresponding embodiment of FIG. 2;
FIG. 8 is a flowchart of one embodiment of step 331 in the corresponding embodiment of FIG. 7;
FIGS. 9a and 9b are schematic diagrams of an implementation of a method for correcting error of a notary document based on a knowledge graph in an application scenario;
FIG. 10 is a block diagram illustrating a schematic diagram of a credential document error correction device based on a knowledge-graph, according to an exemplary embodiment;
FIG. 11 is a hardware block diagram of a server shown in accordance with an exemplary embodiment;
fig. 12 is a block diagram illustrating a structure of an electronic device according to an example embodiment.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
As described above, since the approval of the official document depends on the experience and knowledge of the approval staff of the official certification authority, the approval efficiency of the official document is low, and the quality of the issued official document is uneven.
In recent years, the nation has vigorously advocated the "maple bridge experience", that is, the fundamental contradiction is solved under the guidance of the fundamental policy through a diversified dispute resolution mechanism. The notarization service, as an important part of a diversified dispute resolution mechanism, should be advanced with time, and actively innovate and break through on service capability and service mode, and respond to the service purpose of 'efficient and convenient for people'.
Although notarization services have evolved from paper portfolio to e-government, it is seen that a lot of notarization services are dependent on manual work, and the labor cost of notarization services is still high, which is not favorable for dealing with the service demand of high growth.
At present, in the field of notarization service, a multistage manual review mechanism arranged inside a notarization mechanism is still used for carrying out text verification and content review on a notarization document, and a large amount of human resources are consumed. Moreover, manual verification and audit highly depend on self experience and background knowledge of verification and audit personnel, so that the quality stability of the issued notarization documents is poor, the documents are difficult to share and popularize, and stable productivity is formed. Therefore, the informatization and intelligentization service capability in the public certification field needs to be improved.
As can be seen from the above, the related art still has the defect of low efficiency in auditing the official documents.
Therefore, the notarization document error correction method provided by the application can effectively improve the auditing efficiency of the notarization document, and further effectively improve the quality of the issued notarization document, and accordingly, the notarization document error correction method is suitable for a notarization document error correction device, and the notarization document error correction device can be deployed in electronic equipment with a Von Neumann architecture, for example, the electronic equipment can be a desktop computer, a notebook computer, a server and the like.
To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of an implementation environment related to a method for correcting errors in a notary document based on a knowledge graph. It should be noted that this implementation environment is only one example adapted to the present invention, and should not be considered as providing any limitation to the scope of the present invention.
The implementation environment includes a collection side 110 and a server side 130.
Specifically, the collection terminal 110 may be an electronic device having a function of collecting at least one or more data of pictures, texts, and multimedia, and is not limited in this respect.
The server 130 may be an electronic device such as a desktop computer, a notebook computer, a server, or the like, or may be a computer device cluster formed by multiple servers, or even a cloud computing center formed by multiple servers. The server 130 is configured to provide a background service, for example, the background service includes, but is not limited to, a notarization document error correction service, and the like.
The server 130 and the acquisition terminal 110 are connected in advance through wired or wireless network communication, and data transmission between the server 130 and the acquisition terminal 110 is realized through the network communication connection. The data transmitted includes, but is not limited to: target notary documents, and the like.
Through the interaction between the acquisition terminal 110 and the server terminal 130, the acquisition terminal 110 sends the target notarization document to the server terminal 130, and the server terminal 130 processes the acquired target notarization document by combining the knowledge graph, so that the error correction of the target notarization document can be completed.
Referring to fig. 2, an embodiment of the present application provides a method for correcting errors in a notary document based on a knowledge graph, where the method is applied to an electronic device, and the electronic device may be the server 130 in the implementation environment shown in fig. 1.
In the following method embodiments, for convenience of description, the main execution subject of each step of the method is taken as an electronic device for illustration, but the method is not particularly limited to this configuration.
As shown in fig. 2, the method may include the steps of:
and step 310, acquiring the unknown words of the target notarization document.
The target notarization document refers to a process file and a result file generated throughout the whole process of the notarization service, and includes but is not limited to interview notes, notification books, acceptance notice books, notarization books and the like. That is, the notarization documents involved in the notarization service will be different according to different types of legal relationships of the entity, and therefore, the target notarization documents obtained will be different based on the actual notarization service requirements.
The unknown word of the target notary document is a word which is not included in the notary entity dictionary but must be segmented. In one possible implementation, as shown in fig. 3, in relation to obtaining the unknown words of the target notarization document, step 310 may further include the steps of:
and 311, acquiring the target notarization document, and performing named entity identification on the target notarization document to obtain a first entity.
As previously mentioned, the target notarization document refers to the process files and result files generated throughout the entire process of the notarization service. With respect to named entity recognition, it is meant that the words to which a particular name points are recognized from the target notary document, including but not limited to, person names, place names, legal file names, and organizational names. The named entity recognition can be realized by performing the named entity recognition on the target notarization document by methods such as a dictionary-based method, a rule-based method, unsupervised learning, supervised learning, deep learning and the like, and is not limited herein.
Taking the target notarization document as "statement", the named entity identification is performed on the notarization document, and the obtained first entity may be, but is not limited to: the national Law of people's republic of China, "official matters", "applicant", etc.
In one possible implementation, named entity recognition may invoke a named entity prediction model implementation generated by neural network model training. The named entity prediction model comprises a word vectorization embedding layer, a bidirectional long-short term memory (BilSTM) layer and a Conditional Random Field (CRF) layer, wherein, as shown in FIG. 4a, a hiding layer can be further included between the BilSTM layer and the CRF layer, and the hiding layer can be a self-attention (self-attention) layer, which is not limited herein.
In connection with FIG. 4a, in particular, the training steps of the named entity prediction model are shown in FIG. 4 b.
And step 410, converting each training text input word vectorization layer in the training set to obtain a character vector corresponding to each training text.
Wherein, the training texts in the training set are marked historical notarization documents and notarization related legal documents.
It is first explained that the historical notarization documents and notarization-related legal documents of the training text include, but are not limited to: unstructured data obtained from notary paper roll documents by Optical Character Recognition (OCR) techniques, and structured data obtained from notary service platform databases.
Further, unstructured data obtained by the OCR technology are converted into structured data, and a corpus used for training is constructed by combining the structured data obtained from the notarization service platform database.
After the training text obtains the corpus used for training, labeling each structured data in the training text corpus according to the labeling rules, so as to form a labeled historical notarization document and a notarization-related legal document, thereby obtaining a training set participating in model training. Specifically, the labeling rule may be a BIO labeling rule (B-prefix of named entity, I-non-prefix of named entity, O-non-named entity), a biees labeling rule (B-prefix of named entity, I-non-prefix of named entity, O-non-named entity, suffix of E-named entity, S-independent word), or a labeling rule set according to notarial related law (e.g. "folk dictionary"), regulation, judicial interpretation, which is not limited herein. Taking the BIO labeling rule as an example, if the house belongs to Jia, then it is labeled "B (the) I (house) O (belongs to) O (in) B (A) I (some)".
Then, each training text in the training set is subjected to one-hot coding to obtain a corresponding one-hot vector, and further, a one-hot vector input word vectorization layer (i.e., embedding layer) corresponding to each training text is subjected to conversion processing, the embedding layer converts the one-hot vector corresponding to each training text into a character vector corresponding to each training text, and the character vector corresponding to each training text is input into a bidirectional long-short term memory layer (i.e., a BilSTM layer).
Step 430, inputting the character vectors corresponding to the training texts into the bidirectional long-short term memory layer, and training the bidirectional long-short term memory layer to obtain the labeling scores of the training texts.
The BilSTM layer obtains the mark score of each training text based on the character vector corresponding to each training text and inputs the mark score to the conditional random field layer (namely, CRF layer).
And step 450, inputting the labeling scores of the training texts into a condition random field layer, and calculating the optimal solution of the labeling sequence of each training text by the condition random field layer based on the labeling scores of the training texts.
Step 470, if the optimal solution of the labeling sequence of each training text is not obtained, updating the model parameters; otherwise, obtaining the named entity prediction model.
Specifically, the CRF layer decodes the label scores of the training texts, and calculates the optimal solution of the label sequence of each training text according to the relation between adjacent part-of-speech labels.
And if the optimal solution of the labeling sequence of each training text is obtained through calculation, finishing the training of the two-way long and short term memory layer, and obtaining the named entity prediction model with the named entity recognition capability.
Otherwise, if the CRF layer cannot calculate the optimal solution of the labeling sequence of each training text, updating the model parameters, and continuing to train the bidirectional long-term and short-term memory layer based on the updated model parameters, that is, returning to step 430 until the CRF layer can calculate the optimal solution of the labeling sequence of each training text.
Furthermore, when the named entity prediction model is trained, hyper-parameter optimization can be performed, so that the network does not depend on a neuron weight change method too much to adjust parameters, and the overfitting problem is relieved.
In a possible implementation mode, in the training of the BilSTM-CRF model, a drop Dropout technology is adopted, a part of neurons are randomly dropped during each training, the dropped neurons cannot influence propagation, then the optimal parameters obtained during each training are collected and stored, and after the training is completed, the BilSTM-CRF model with the optimal parameters is packaged into a named entity prediction model with the named entity recognition capability.
After the training process is completed, the named entity prediction model has the named entity recognition capability suitable for the target notarization document, so that the named entity recognition can be performed on the target notarization document to obtain the entities in the target notarization document.
Step 313, calculating a second similarity between the second entity and the first entity based on the second entity in the notarized entity dictionary.
The second entity is an entity stored in the notary entity dictionary, which may be constructed in advance by performing named entity recognition on each training text through the named entity prediction model obtained through the training in step 470, as described above.
Regarding the second similarity, the second similarity between the first entity and the second entity can be obtained by calculating parameters such as a Jacard similarity coefficient, a cosine similarity, an edit distance, and the like between the two entities.
Step 315, the first entity with the second similarity satisfying the set condition is used as the unknown word.
It is understood that the smaller the second similarity between the first entity and the second entity, the less similar the first entity and the second entity, i.e., the lower the probability that the first entity is included in the notary entity dictionary. Therefore, the setting conditions can be configured as: if the second similarity of the two entities meets the set condition, the first entity is not recorded in the notarization entity dictionary, and at the moment, the first entity is the unknown word.
The second similarity condition may be adaptively adjusted based on the second similarity calculated in step 313, for example, the second similarity is an edit distance, and accordingly, the condition is set as an edit distance threshold, and when the edit distance between the first entity and the second entity exceeds the edit distance threshold, the first entity is an unknown word.
And 330, based on each entity in the notarization field knowledge map, carrying out fuzzy matching on the unknown words to obtain target entities with entity link relation with the unknown words.
The target entity refers to an entity which has the same or similar meaning with the unknown word in the public certificate domain knowledge map.
First, it is explained that the notary domain knowledge graph is used to describe concepts and their interrelationships in the notary domain. Based on this, before fuzzy matching is carried out on unknown words, a notary field knowledge map needs to be created.
As shown in fig. 6, creating a notary domain knowledge graph may include the steps of:
and step 610, acquiring the historical notarization documents and the legal documents related to the notarization, and naming the entities to identify the historical notarization documents and the legal documents related to the notarization to obtain a plurality of training entities.
As described above, the named entity recognition can be performed on the historical notarization documents and the legal documents related to the notarization by methods based on rules, unsupervised learning, supervised learning, deep learning and the like, and the obtained recognition results are a plurality of training entities.
In a possible implementation manner, in this embodiment, a named entity prediction model is called to identify named entities in a historical notarization document and a notarization-related legal document, so as to obtain a plurality of training entities.
Step 630, extracting the association relationship between the training entities.
The association relationship may be extracted by a rule-based relationship extraction method, a predefined relationship type, a deep learning method, and the like, which is not limited herein. In one possible implementation manner, the association relationship among a plurality of training entities is extracted based on a rule-based relationship extraction method. Firstly, a notarization entity dictionary and a notarization related legal dictionary are utilized, and a doccano part-of-speech tagging tool is used for carrying out word segmentation and part-of-speech tagging on the historical notarization document and the notarization related legal document, so that the tagged historical notarization document is obtained.
It is noted that the training entities can be obtained in step 610, and can also be obtained by utilizing a notarization entity dictionary to perform word segmentation on historical notarization documents and notarization-related legal documents. The notarization entity dictionary is constructed in advance by conducting named entity recognition on the historical notarization documents and the legal documents related to notarization through the named entity prediction model obtained through training in step 470. Similarly, the notarization legal dictionary can also perform named entity recognition and pre-construction on the legal documents related to the notarization (such as the national common and national folk law convention) through the named entity prediction model, and then perform word segmentation on the legal documents related to the notarization by using the notarization legal dictionary to obtain a plurality of training entities of which the association relationship is to be extracted.
Secondly, a number of training entities are part-of-speech tagged by NLP (Natural Language Processing) text tagging tools (e.g., doccano part-of-speech tagging tools). Furthermore, the training entities are also subjected to part-of-speech tagging according to tagging rules set by notarization-related laws and regulations, judicial interpretation and the like. Then, the "national official certification comprehensive management information system technical code" is used to configure the "relationship extraction rule", as shown in table 1.
TABLE 1 relationship extraction rules
Figure BDA0003919756570000091
And finally, extracting the association relation among the training entities subjected to part-of-speech tagging based on the relation extraction rule. The pattern matching is a basic operation of character strings in a data structure, and at least one substring which is the same as the substring is required to be found in a certain character string given one substring. For example, as shown in table 1, if a substring "belongs" is given, and if the substring "belongs" exists in a character string composed of a plurality of training entities which are part-of-speech labeled in a historical notarization document and a notarization-related legal document, the association relationship among the training entities is considered as a dependency relationship.
Step 650, storing each training entity in each node, and constructing a path between corresponding nodes based on the association relationship between each training entity.
First, it is explained that the notarization field knowledge graph includes nodes and paths, where the nodes are used to store training entities and entity attributes of the training entities, such as format attributes and text descriptions, and the paths are used to connect corresponding nodes having association relationships, and store association relationships between corresponding nodes (i.e. between training entities), such as "belong", "yes", and the like.
When an association relationship exists between two nodes, a path can be constructed based on the association relationship, and the two nodes are connected. For example, the association relationship between the training entity "plum" and "house" is that the house belongs to plum, and a path is constructed for the two, so that "plum < - > -belongs to-house" is obtained.
And 670, constructing a notarization field knowledge graph according to each node and each path.
And extracting the association relation among the nodes, constructing a path for each node based on the association relation, and connecting each node to obtain the notarization field knowledge map. In one possible implementation, the training entities and associations of the notary domain knowledge graph are stored in the Neo4j graph database, which is not limited herein.
After the notarization field knowledge map is constructed, fuzzy matching of the unknown words in the target notarization document can be achieved.
With fuzzy matching, it is meant to look for entities in the notary domain knowledge graph that have the same or similar meaning as the unknown word. It can be understood that when the entity with the same meaning as the unknown word does not exist in the notarization field knowledge map, the entity with the similar meaning to the unknown word can be continuously searched, and finally, the target entity is determined from the searched entities with the same or similar meaning.
It can be understood that if entities with the same or similar meanings cannot be found in the knowledge graph through fuzzy matching, the fuzzy matching fails.
In one possible implementation, as shown in fig. 5, step 330 further includes the following steps:
and 510, if the fuzzy matching fails, establishing a new node in the notarization field knowledge graph.
Step 530, link the unknown word to the new node.
When the fuzzy matching fails, the target entity corresponding to the unknown word does not exist in the public certificate domain knowledge graph, so that a new node can be created in the public certificate domain knowledge graph, the unknown word is linked to the new node, and the entity attribute and the association relation of the node where the unknown word is located can be updated in the knowledge graph subsequently, so that the self-excited updating of the knowledge graph is realized.
Step 350, replacing the unknown words of the target notarization document with the target entities.
And after the target entity is obtained, replacing the corresponding unknown words in the target notarization document with the target entity to finish the error correction of the target notarization document.
With respect to error correction of the target document, the target notary document may have homonymous entities or homonymous unknown words. Wherein, the entities with the same name and different names refer to the situation that the entity names are the same but the entity meanings are different, for example, the entity "apple" can correspond to "apple-fruit", "apple company", "apple mobile phone", etc.; the synonym refers to the same meaning of the entity, but the name of the entity is different, for example, the names such as "THU", "qinghua" and the like can all correspond to the entity "qinghua university". Therefore, the target entity is used for replacing the unknown words in the target notarization document, and entity disambiguation of the phenomenon that the target notarization document is homonymous and heteronymous or homonymous is substantially realized.
In a possible implementation manner, in order to facilitate the notary to verify the replaced part of the target notarization document, the notarization member is prompted in a mode of marking shading, highlighting, commenting and the like on the replaced part, and the record of the notarization member after correcting the label of the system replacement error is collected to be used as a training set for next round of system updating.
Through the process, firstly, the unknown words in the target notarization document are obtained, the target entity which has entity link relation with the unknown words is searched in the notarization field knowledge map in a fuzzy matching mode, the target entity is used for replacing the unknown words in the target notarization document, the intelligent searching for the errors of the target notarization document is achieved, the error correction of the target notarization document is completed, the certificate-out quality of the target notarization document is improved, and the problem that the auditing efficiency of the notarization document is low is solved.
Referring to fig. 7, a possible implementation manner is provided in the embodiment of the present application, and step 330 may further include the following steps:
step 331, determining similarity of the unknown word and each entity in the notarization field knowledge graph.
As mentioned above, if the entity with the same meaning as the unknown word cannot be searched in the knowledge graph by fuzzy matching, the entity with the similar meaning to the unknown word is searched, and of course, there may be a plurality of entities with the similar meaning to the unknown word, or there may be only one entity, or even there may be no entity, and the method is not limited herein.
In this embodiment, a target entity is selected from a plurality of entities in the notarization field knowledge graph based on the similarity between the unknown words and each entity. The similarity between the unknown word and each entity can be measured by the similarity between the unknown word and each entity, and can also be measured by the similarity between the text segment of the entity corresponding to the unknown word and the text description corresponding to each entity, which is not limited herein.
In one possible implementation, as shown in fig. 8, step 331 may include the following steps:
step 3311, extract the entity text segment corresponding to the unknown word in the target notarization document, and convert the entity text segment into an entity text vector.
The entity text segment refers to a section of context of the unknown word in the target notarization document. It is also understood that the entity text fragment is a context in the target notary document containing the unknown word.
After the entity text segment corresponding to the unknown word is extracted and obtained, the corresponding entity text vector can be obtained through conversion processing such as word vector and the like.
Step 3313, based on the notarization field knowledge graph, query the text description corresponding to each entity, and perform word vector transformation on each entity and the corresponding text description to obtain the entity description vector corresponding to each entity.
And storing the text description corresponding to each entity as the entity attribute of the entity in the notarization field knowledge graph.
It can be understood that the text descriptions corresponding to the entities are stored in the notarization field knowledge graph as the entity attributes of the entities, so that the text descriptions corresponding to the entities can be obtained through query based on the notarization field knowledge graph.
After the text description corresponding to each entity is obtained through query, the entity description vector corresponding to each entity can be obtained through conversion processing such as word vector.
Step 3315, calculate the first similarity of the entity text vector and each entity description vector, and obtain the similarity of the unknown word and each entity, respectively.
Wherein the first similarity is used to indicate a similarity of the unknown word to an entity in the notary domain knowledge graph. The first similarity may be a jaccard similarity coefficient, a cosine similarity, an edit distance, and the like, which are not limited herein.
Taking the first similarity as cosine similarity as an example, the entity description vector is a = (a) 1 ,a 2 ......a n ) The entity text vector is B = (B) 1 ,b 2 ......b n ) The cosine similarity can be calculated based on the following calculation formula (1), where similarity 1 is the cosine similarity between the entity description vector and the entity text vector.
Figure BDA0003919756570000121
Therefore, the similarity between the unknown words and the corresponding entities can be determined based on the cosine similarity between the entity description vector and the entity text vector. It should be understood that the higher the cosine similarity, the more similar the unknown word is to each entity.
Step 333, based on the determined similarity, a target entity is screened from the entities.
It is understood that the higher the first degree of similarity between the entity text vector and the entity description vector, the more similar the unknown word is to the corresponding entity. Therefore, the first similarity corresponding to each entity obtained by calculation can be sorted, the entity description vector with the highest first similarity is selected as a target entity vector, and the entity corresponding to the target entity vector is selected as a target entity; or a similarity threshold may be set, and the entity description vector with the first similarity exceeding the similarity threshold and the highest first similarity is selected as the target entity vector, and the entity corresponding to the target entity vector is the target entity, which is not limited herein.
Under the effect of the embodiment, the target entity is determined by determining the similarity between the unknown words and each entity of the knowledge map in the notarization field, and the unknown words in the target notarization document are replaced by the target entity, so that the error correction accuracy of the target notarization document is improved, the certification quality of the notarization document is improved, and the auditing efficiency of the notarization document is further improved.
Fig. 9a and 9b are schematic diagrams illustrating a specific implementation of a notarization document error correction method in an application scenario. This application scenario applies to the implementation environment shown in fig. 1.
In this application scenario, as shown in fig. 9a, the notary document error correction process includes three stages: the document management method comprises a notarization document analysis stage, a knowledge matching stage and a document replacement and display stage.
Referring now to FIG. 9b, the following detailed description of the various stages of the notary document error correction process is provided:
firstly, the parsing stage of the notarization document specifically comprises:
notary document input 701, through step 801, obtains a target notary document. The target notarization document is a notarization document which needs to be checked before the document is issued.
The unknown words are extracted 703, and in step 803, the unknown words of the target official document are extracted.
The entity relationship link 705 provides named entities and relationships among the named entities in the historical notarization document and the notarization-related legal document based on the pre-constructed notarization field knowledge graph.
Secondly, the knowledge matching stage specifically comprises:
the entity relationship link 707 is obtained by passing the named entities and the relationships between the named entities in the historical notarization document and the legal document related to the notarization to the knowledge matching stage from the notarization field knowledge map in the parsing stage of the notarization document.
Fuzzy matching 709, in step 805, fuzzy matching is performed on the unknown word based on the notarization field knowledge graph to obtain a target entity.
The entity disambiguation 711 can implement entity disambiguation of the same-name different entities or the same-entity different-name phenomenon in the target notarization document after obtaining the target entity.
Finally, the document replacement and display stage specifically comprises:
the notary document unknown word replacement 713 replaces the unknown words of the target notary document with the target entities, via step 807.
And an interface display 715, which is used for displaying the replaced target notarization document to the notary in a corresponding display page through step 809, so that the notary can conveniently check the replaced part of the target notarization document, the function of expert supervision is achieved, and the notary can be prompted in modes of shading, highlighting, commenting and the like on the replaced part.
In the application scene, the target entity is obtained by fuzzy matching of the unknown words of the target notarization document in the notarization field knowledge map, and the target entity is used for replacing the unknown words in the target notarization document so as to finish error correction of the target notarization document.
The following are embodiments of the apparatus of the present application that may be used to perform the notary document error correction method of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the method embodiments of the notary document error correction method related to the present application.
Referring to fig. 10, in the embodiment of the present application, a notary document error correction apparatus 900 based on knowledge-graph is provided, which includes but is not limited to: an acquisition module 910, a fuzzy matching module 930, and an entity replacement module 950.
The obtaining module 910 is configured to obtain an unknown word of the target notarization document.
The fuzzy matching module 930 is configured to perform fuzzy matching on the unknown words based on each entity in the notarization field knowledge graph to obtain target entities having entity link relations with the unknown words.
And an entity replacing module 950 for replacing the unknown words of the target notarization document with the target entity.
It should be noted that, in the notarization document error correction apparatus provided in the above embodiment, only the division of the above functional modules is taken as an example for performing notarization document error correction, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the notarization document error correction apparatus is divided into different functional modules to complete all or part of the above described functions.
In addition, the notarization document error correction apparatus provided by the above embodiment and the notarization document error correction method belong to the same concept, wherein the specific manner in which each module executes operations has been described in detail in the method embodiment, and is not described herein again.
FIG. 11 illustrates a structural schematic of a server in accordance with an exemplary embodiment. The server is suitable for use in the server 130 of the implementation environment shown in fig. 1.
It should be noted that the server is only an example adapted to the application and should not be considered as providing any limitation to the scope of the application. The server should not be interpreted as having to rely on or have to have one or more components of the exemplary server 2000 illustrated in fig. 11.
The hardware structure of the server 2000 may be greatly different due to the difference of configuration or performance, as shown in fig. 11, the server 2000 includes: a power supply 210, an interface 230, at least one memory 250, and at least one Central Processing Unit (CPU) 270.
Specifically, the power supply 210 is used to provide operating voltages for the various hardware devices on the server 2000.
The interface 230 includes at least one wired or wireless network interface 231 for interacting with external devices. For example, the interaction between the terminal 100 and the server 200 in the implementation environment shown in fig. 1 is performed.
Of course, in other examples of the present application, the interface 230 may further include at least one serial-to-parallel conversion interface 233, at least one input/output interface 235, at least one USB interface 237, and the like, as shown in fig. 11, which is not limited thereto.
The storage 250 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon include an operating system 251, an application 253, data 255, etc., and the storage manner may be a transient storage or a permanent storage.
The operating system 251 is used for managing and controlling each hardware device and the application 253 on the server 2000, so as to implement the operation and processing of the mass data 255 in the memory 250 by the central processing unit 270, which may be Windows server, mac OS XTM, unix, linux, freeBSDTM, and the like.
The application 253 is a computer program that performs at least one specific task on the operating system 251, and may include at least one module (not shown in fig. 11), each of which may include a computer program for the server 2000. For example, the notary document error correction device may be considered as an application 253 deployed on the server 2000.
Data 255 may be photographs, pictures, etc. stored on disk, or may be notary domain knowledge maps, target notary documents, etc. stored in memory 250.
The central processor 270 may include one or more processors and is configured to communicate with the memory 250 through at least one communication bus to read the computer programs stored in the memory 250, so as to realize the operation and processing of the mass data 255 in the memory 250. The notary document error correction method is accomplished, for example, by the central processor 270 reading a form of a series of computer programs stored in the memory 250.
Furthermore, the present application can also be implemented by hardware circuits or hardware circuits in combination with software, and therefore, the implementation of the present application is not limited to any specific hardware circuits, software, or a combination of the two.
Referring to fig. 12, in an embodiment of the present application, an electronic device 4000 is provided, where the electronic device 4000 may include: desktop computers, notebook computers, servers, and the like.
In fig. 12, the electronic device 4000 includes at least one processor 4001, at least one communication bus 4002, and at least one memory 4003.
Processor 4001 is coupled to memory 4003, such as by communication bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computing function, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, etc.
Communication bus 4002 may include a path that carries information between the aforementioned components. The communication bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
A computer program is stored in the memory 4003, and the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002.
The computer program realizes the notarization document error correction method in the above embodiments when executed by the processor 4001.
In addition, in the embodiments of the present application, a storage medium is provided, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the notarization text error correction method in the above embodiments.
A computer program product is provided in an embodiment of the present application, the computer program product comprising a computer program stored in a storage medium. The processor of the computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device executes the notary document error correction method in each of the above embodiments.
Compared with the prior art, the method comprises the steps of firstly, acquiring the unknown words in the target notarization document, searching the target entity which has entity link relation with the unknown words in the notarization field knowledge map in a fuzzy matching mode, and replacing the unknown words with the target entity in the target notarization document, so that the target notarization document can be intelligently searched for errors, and the target notarization document can be corrected, thereby improving the certification quality of the target notarization document and solving the problem of low auditing efficiency of the notarization document.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A notarization document error correction method based on knowledge graph is characterized by comprising the following steps:
acquiring unknown words of the target notarization document;
based on each entity in the notarization field knowledge map, carrying out fuzzy matching on the unknown words to obtain target entities having entity link relation with the unknown words;
replacing the unknown words of the target notary document with the target entity.
2. The method of claim 1, wherein the fuzzy matching of the unknown word based on each entity in the notary domain knowledge graph to obtain a target entity in a link relationship with the unknown word existence entity comprises:
determining similarity of the unknown words and each entity in the notarization field knowledge graph;
based on the determined similarity, the target entity is screened from each of the entities.
3. The method of claim 2, wherein determining, for each of the entities in the notary domain knowledge graph, a similarity of the unknown word to each of the entities comprises:
extracting an entity text segment corresponding to the unknown word in the target notarization document, and converting the entity text segment into an entity text vector;
inquiring text description corresponding to each entity based on the notarization field knowledge graph, and performing word vector conversion on each entity and the corresponding text description to obtain an entity description vector corresponding to each entity;
and calculating the first similarity of the entity text vector and each entity description vector to respectively obtain the similarity of the unknown words and each entity.
4. The method of claim 1, wherein after fuzzy matching of the unknown word based on entities in the notary domain knowledge graph to obtain a target entity having an entity relationship with the unknown word, the method further comprises:
if the fuzzy matching fails, a new node is created in the notarization field knowledge graph;
and linking the unknown words to the new nodes to perform fuzzy matching based on the updated notarization field knowledge graph.
5. The method of claim 1, wherein the obtaining of the unknown word of the target notarization document comprises:
acquiring the target notarization document, and carrying out named entity identification on the target notarization document to obtain a first entity;
calculating a second similarity of a second entity to the first entity based on the second entity in the notary entity dictionary;
and taking the first entity with the second similarity meeting a set condition as the unknown word.
6. The method of claim 1, wherein the method further comprises:
acquiring a historical notarization document and a notarization-related legal document, and carrying out named entity recognition on the historical notarization document and the notarization-related legal document to obtain a plurality of training entities;
extracting the incidence relation among the training entities;
respectively storing each training entity in each node, and constructing a path between corresponding nodes based on the association relation between each training entity;
and constructing the notarization field knowledge graph according to each node and each path.
7. The method of claim 5 or 6, wherein the named entity recognition is implemented by invoking a named entity prediction model; the named entity prediction model comprises a word vectorization layer, a bidirectional long-short term memory layer and a conditional random field layer;
the training process of the named entity prediction model comprises the following steps:
inputting each training text in a training set into the word vectorization layer for conversion processing to obtain a character vector corresponding to each training text; the training texts in the training set are marked historical notarization documents and notarization related legal documents;
inputting the character vectors corresponding to the training texts into the bidirectional long-short term memory layer, and training the bidirectional long-short term memory layer to obtain the label scores of the training texts;
inputting the labeling scores of the training texts into the conditional random field layer, wherein the conditional random field layer calculates the optimal solution of the labeling sequence of each training text based on the labeling scores of the training texts;
if the optimal solution of the labeling sequence of each training text is not obtained, updating model parameters; otherwise, obtaining the named entity prediction model.
8. A device for correcting errors of notarization documents based on knowledge graph, comprising:
the acquisition module is used for acquiring the unknown words of the target notarization document;
the fuzzy matching module is used for carrying out fuzzy matching on the unknown words based on each entity in the notarization field knowledge map to obtain target entities which have entity link relations with the unknown words;
and the entity replacing module is used for replacing the unknown words of the target notarization document with the target entity.
9. An electronic device, comprising: at least one processor, at least one memory, and at least one communication bus, wherein,
the memory has a computer program stored thereon, and the processor reads the computer program in the memory through the communication bus;
the computer program, when executed by the processor, implements the notary document error correction method of any of claims 1 to 7.
10. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the notary document error correction method of any of claims 1 to 7.
CN202211355110.0A 2022-11-01 2022-11-01 Notarization document error correction method and device, electronic device and storage medium Pending CN115587800A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211355110.0A CN115587800A (en) 2022-11-01 2022-11-01 Notarization document error correction method and device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211355110.0A CN115587800A (en) 2022-11-01 2022-11-01 Notarization document error correction method and device, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN115587800A true CN115587800A (en) 2023-01-10

Family

ID=84781122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211355110.0A Pending CN115587800A (en) 2022-11-01 2022-11-01 Notarization document error correction method and device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN115587800A (en)

Similar Documents

Publication Publication Date Title
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN108287911B (en) Relation extraction method based on constrained remote supervision
CN110597870A (en) Enterprise relation mining method
WO2019060010A1 (en) Content pattern based automatic document classification
CN115687647A (en) Notarization document generation method and device, electronic equipment and storage medium
CN116089873A (en) Model training method, data classification and classification method, device, equipment and medium
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN110826315B (en) Method for identifying timeliness of short text by using neural network system
CN112417887A (en) Sensitive word and sentence recognition model processing method and related equipment thereof
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN111339396B (en) Method, device and computer storage medium for extracting webpage content
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN112598039B (en) Method for obtaining positive samples in NLP (non-linear liquid) classification field and related equipment
US11163761B2 (en) Vector embedding models for relational tables with null or equivalent values
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
CN115934852A (en) Tax registration address space-time clustering method, device, server and storage medium
CN115269816A (en) Core personnel mining method and device based on information processing method and storage medium
CN115587800A (en) Notarization document error correction method and device, electronic device and storage medium
CN114493853A (en) Credit rating evaluation method, credit rating evaluation device, electronic device and storage medium
Gong Analysis of internet public opinion popularity trend based on a deep neural network
CN112948561A (en) Method and device for automatically expanding question-answer knowledge base
Wen et al. Blockchain-based reviewer selection
CN112529743A (en) Contract element extraction method, contract element extraction device, electronic equipment and medium
CN116775889B (en) Threat information automatic extraction method, system, equipment and storage medium based on natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination