CN114595675A - Method and device for tracking difference content between documents and electronic equipment - Google Patents

Method and device for tracking difference content between documents and electronic equipment Download PDF

Info

Publication number
CN114595675A
CN114595675A CN202210233372.3A CN202210233372A CN114595675A CN 114595675 A CN114595675 A CN 114595675A CN 202210233372 A CN202210233372 A CN 202210233372A CN 114595675 A CN114595675 A CN 114595675A
Authority
CN
China
Prior art keywords
document
elements
documents
determining
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210233372.3A
Other languages
Chinese (zh)
Inventor
安飞飞
李昱
张圳
李斌
谷利峰
王全礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202210233372.3A priority Critical patent/CN114595675A/en
Publication of CN114595675A publication Critical patent/CN114595675A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for tracking difference content between documents, and an electronic device. The method comprises the steps of sequencing at least two documents according to document release time to obtain a document set; the document is a document containing preset content; determining elements in each document in the document set through a neural network model to obtain an element set; the neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained through fitting; obtaining an element relation graph based on the corresponding relation between the elements in the element set and the documents; wherein the element relation diagram indicates the existence of any kind of elements in different documents. By the method, the problem that in the prior art, when a large number of documents are updated, the maintenance efficiency of the related documents is low can be solved.

Description

Method and device for tracking difference content between documents and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for tracking difference content between documents, and an electronic device.
Background
Every year, documents in a plurality of industries are produced from the center to local governments and then to headquarters and branches of enterprises and public institutions; such as instructions, special work, business notifications, etc. These institutional content-based documents may also have some associations with each other while the documents are showing up new. For example, the release of a new degree often involves changes in the documentation of guidelines, specials, etc. within the enterprise or business. These changes result in the need for a large amount of manpower and material resources for maintenance (updating) of the relevant documents, and accordingly, the efficiency of the institutional training is reduced, and the unnecessary cost of the institutional training is increased. Therefore, the current manual maintenance method cannot meet the related business requirements. A more efficient maintenance mode is needed for the documents, so that when a large amount of document updates and differences occur, the maintenance efficiency of the related documents is effectively improved, and the maintenance cost of the related system including the training cost is reduced.
Disclosure of Invention
The application provides a method and a device for tracking difference contents between documents and electronic equipment, which are used for solving the problem that in the prior art, when a large number of documents are updated, the maintenance efficiency of related documents is low.
In a first aspect, the present application provides a method for tracking content of differences between documents, the method including:
sequencing at least two documents according to the document release time to obtain a document set; the document is a document containing preset content;
determining elements in each document in the document set through a neural network model to obtain an element set; the neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained through fitting;
obtaining an element relation graph based on the corresponding relation between the elements in the element set and the documents; wherein the element relation graph indicates the existence of any kind of elements in different documents.
According to the method for extracting the elements from the documents sorted according to the time sequence by using the neural network model and establishing the element relation diagram according to the corresponding relation between the elements and the documents, the transformation condition of any kind of elements in the related documents can be clearly and intuitively read through the element relation diagram, so that the purpose of efficiently tracking the difference content between the documents is achieved. Moreover, the change condition of the service line where the difference content is located can be determined quickly and efficiently, so that the maintenance efficiency of the related document is effectively improved, and the maintenance cost is reduced.
One possible implementation manner, before determining, by a neural network model, an element in each document in the document set and obtaining an element set, includes:
in the neural network model, identifying character features of any dimension in a sample document;
fitting the character features of different dimensions based on the identified character features of any dimension to obtain fitting features, and determining characters corresponding to the fitting features and elements corresponding to the characters until the accuracy rate of the determined elements, namely preset elements corresponding to the characters, is greater than a set threshold;
classifying the elements in the sample document based on element categories in an element map, and determining the element categories to which the elements belong; until the accuracy rate of the determined element class is that the element belongs to the preset element class is greater than a classification threshold value.
One possible implementation manner, in the neural network model, the identifying the character features of any dimension in the document includes:
in the neural network model, converting the speech segment with the first length in each sample document in a sample document set into a corresponding document vector; wherein the document vector is extracted based on character features of any dimension of characters in the word segments;
identifying elements in the document vector based on a preset rule, and determining character features of any dimension; the preset rule comprises meanings represented by different combination modes among different elements and corresponding relations between the meanings and the character features.
One possible implementation manner, before determining, by a neural network model, an element in each document in the document set and obtaining an element set, includes:
adding a document identifier aiming at the language segment with the second length in each document; the language segment consists of at least one character, and the document identifier uniquely identifies a document corresponding to the language segment;
then, obtaining an element relation graph based on the corresponding relation between the elements in the element set and the document, including:
extracting the document identifier included in each element in each class of elements in the element set;
and establishing the corresponding relation between the elements in the element set and the document based on the corresponding relation between the document identifier and the document to obtain an element relation graph.
One possible implementation manner, obtaining an element relation graph based on correspondence between elements in the element set and documents, includes:
determining a difference feature between the at least two documents in the element relationship graph based on the existence; wherein the difference feature is comprised of at least one element;
labeling the difference characteristics in a first document of the at least two documents to obtain key information corresponding to each document; the key information comprises whether the element is a newly added element, the importance of the difference feature, the relevance of the difference feature and corresponding regulation content and semantic understanding output of the difference feature.
In a possible implementation manner, after obtaining the document set, the method further includes:
determining subsequences in the two adjacent documents in the document set respectively; wherein the sub-sequence indicates sentences in which characters are arranged in a set order in the two adjacent documents;
determining the longest subsequence in the two adjacent documents through dynamic programming;
comparing any two longest subsequences in the two adjacent documents, and determining two target longest subsequences meeting the similarity requirement; wherein the longest subsequence indicates a longest subsequence that is consistent with the set subsequence order;
determining the content with difference in the two target longest subsequences, and marking the content with difference as the difference content;
and adjusting the existence condition of any type of elements in the element relation graph based on the difference content.
In a second aspect, the present application provides an apparatus for tracking content of differences between documents, the apparatus comprising:
a sorting unit: the document processing system is used for sequencing at least two documents according to the document release time to obtain a document set; the document is a document containing preset content;
a model unit: determining elements in each document in the document set through a neural network model to obtain an element set; the neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained through fitting;
a generation unit: the element relation graph is obtained based on the corresponding relation between the elements in the element set and the documents; wherein the element relation graph indicates the existence of any kind of elements in different documents.
In a possible embodiment, the apparatus further includes a training unit, specifically configured to identify, in the neural network model, character features of any dimension in a sample document; fitting the character features of different dimensions based on the identified character features of any dimension to obtain fitting features, and determining characters corresponding to the fitting features and elements corresponding to the characters until the accuracy rate of the determined elements, namely preset elements corresponding to the characters, is greater than a set threshold; classifying the elements in the sample document based on element classes in an element map, and determining the element class to which the elements belong; until the accuracy rate of the determined element class is that the element belongs to the preset element class is greater than a classification threshold value.
In a possible implementation manner, the training unit is specifically configured to, in the neural network model, convert a corpus of first length in each sample document in a sample document set into a corresponding document vector; wherein the document vector is extracted based on character features of any dimension of characters in the word segments; identifying elements in the document vector based on a preset rule, and determining character features of any dimension; the preset rule comprises meanings represented by different combination modes among different elements and corresponding relations between the meanings and the character features.
In a possible implementation manner, the apparatus further includes an identification unit, specifically configured to add a document identifier to the corpus of the second length in each document; the language segment consists of at least one character, and the document identifier uniquely identifies a document corresponding to the language segment;
the generating unit is specifically configured to extract, in the element set, the document identifier included in each element in each class of elements; and establishing the corresponding relation between the elements in the element set and the document based on the corresponding relation between the document identifier and the document to obtain an element relation graph.
In a possible embodiment, the generating unit is further configured to determine a difference feature between the at least two documents in the element relationship graph based on the existence condition; wherein the difference feature is comprised of at least one element; labeling the difference characteristics in a first document of the at least two documents to obtain key information corresponding to each document; the key information comprises whether the element is a newly added element, the importance of the difference feature, the relevance of the difference feature and corresponding regulation content and semantic understanding output of the difference feature.
In a possible embodiment, the apparatus further includes an optimization unit, specifically configured to determine sub-sequences in the two adjacent documents in the document set respectively; wherein the sub-sequence indicates sentences in which characters are arranged in a set order in the two adjacent documents; determining the longest subsequence in the two adjacent documents through dynamic programming; comparing any two longest subsequences in the two adjacent documents, and determining two target longest subsequences meeting the similarity requirement; wherein the longest subsequence indicates a longest subsequence that is consistent with the set subsequence order; determining the content with difference in the two target longest subsequences, and marking the content with difference as the difference content; and adjusting the existence condition of any type of elements in the element relation graph based on the difference content.
In a third aspect, the present application provides an electronic device, comprising:
a memory for storing a computer program;
a processor configured to execute the computer program stored in the memory to implement the method according to the first aspect and any of the possible embodiments.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect and any one of the possible embodiments.
In a fifth aspect, the present application provides a computer program product which, when run on a computer, causes the computer to perform the method according to the first aspect and any one of the possible embodiments.
Drawings
FIG. 1 is a flowchart of a method for tracking content differences between documents according to the present application;
FIG. 2 is a schematic representation of a portion of an element map provided herein;
FIG. 3 is a schematic diagram of inter-document difference content tracking via a neural network model provided herein;
FIG. 4 is a schematic structural diagram of an apparatus for tracking content of differences between documents according to the present application;
FIG. 5 is a schematic structural diagram of an electronic device for tracking content of differences between documents according to the present application.
Detailed Description
Aiming at the problem that the maintenance efficiency of related documents is low when a large number of documents are updated in the prior art, the embodiment of the application provides a method for tracking the difference content between the documents, which comprises the following steps: and sorting all the documents, inputting the sorted documents into a neural network model to identify and output elements in the documents, and then establishing a corresponding relation between the elements and the documents to form an element relation graph. The content of document deletion, addition and modification can be clearly and visually determined through the element relation graph, so that the document maintenance efficiency is effectively improved, and the maintenance cost of a relevant system including the training cost is reduced.
It should be noted that, in the technical solution of the present application, the acquisition, storage, use, processing, etc. of data all conform to the relevant regulations of the national laws and regulations.
In order to better understand the technical solutions of the present application, the following detailed descriptions of the technical solutions of the present application are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features of the embodiments and the examples of the present application are detailed descriptions of the technical solutions of the present application, and are not limitations of the technical solutions of the present application, and the technical features of the embodiments and the examples of the present application may be combined with each other without conflict.
Referring to fig. 1, an embodiment of the present application provides a method for tracking content differences between documents, so as to improve the maintenance efficiency of the documents, the method specifically includes the following steps:
step 101: and sequencing at least two documents according to the document publishing time to obtain a document set.
The document is a document containing preset content.
Specifically, the influence of a new schedule release is often all-round, and not only the influence on the document content of the last version of the schedule but also the change of the content in the related business document is concerned. Particularly in the financial field, institutional adjustment requirements issued by superior regulatory agencies may affect institutional changes across multiple business lines. In the embodiment of the application, all documents are ordered according to a certain rule to obtain a document set. The relevant content of the documents in the document set can establish the sequence on the time line. Here, a certain rule is a document release time.
It should be noted that the scheme provided by the embodiment of the present application can be applied to a scenario that maintenance needs to be performed on system documents. In which, the system refers to the general rules or action rules for regulating the actions of the individual according to rules (or modes). The system document refers to a document which is authenticated by a quality system and relates to related contents such as enterprise technology, finance, management, safety, personnel, archives, management and the like; these documents have file names, file numbers, corporate official stamps, etc. Generally, there will be business, specification, and related work content overlap between institutional documents and institutional documents.
Step 102: and determining elements in each document in the document set through a neural network model to obtain an element set.
The neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained by the set.
Specifically, before using the neural network model, the neural network model needs to be trained to output correct content. The following is a detailed description of the training of the neural network model:
generally, the neural network model may include an input layer, a hidden layer, and an output layer. The input layer is used for receiving a sample of the two-dimensional visual pattern, namely reading characters in sample documents in a sample document set to be input into the neural network model. The hidden layer can be accessed after reading; the hidden layer is used for identifying character features of any dimension of the character. When character features are recognized, because the lengths of the documents are different, the neural network model firstly cuts each sample document in a sample document set into language segments with unit length; i.e. the speech segments of the first length, and converting all the speech segments of the first length into corresponding document vectors. The document vector is extracted based on character features of any dimension of the first-length corpus characters. The character features of any dimension can be understood as character features extracted from different sides of the character. The character feature of any dimension may be the structure of each character in the first-length corpus, or the part of speech of each character in the first-length corpus. Then, elements in the document vector are identified based on preset rules, and character features of any dimension are determined. The preset rule may include meanings represented by different combination modes among different elements, and correspondence between the meanings and the character features.
Further, after multidimensional character feature recognition is carried out on each speech segment with the first length in the document, character features with different dimensions can be fitted to obtain fitting features, characters corresponding to the fitting features and elements corresponding to the characters are determined, and it is determined that feature fitting functions in the neural network model are qualified to meet use requirements if the elements determined through the neural network model are preset elements corresponding to corresponding characters in the sample document, and the accuracy is higher than a set threshold value. Specifically, the method comprises the following steps:
the feature vectors for each document may be determined by first fitting document vectors that include character features for different dimensions of the same character. The feature vector may be a plurality of feature vectors, each of which indicates character features of different first-length word segments in a sample document. Then, the first element corresponding to the element in each feature vector can be determined according to the mapping relationship between the character feature and the element. It should be understood that the correspondence relationship between elements is not limited to one-to-one correspondence, and may be continuous or discontinuous multiple elements combined to correspond to one element. After the first elements are obtained, all the first elements are respectively compared with the corresponding preset elements to determine whether the first elements are the same, so that the accuracy of the feature fitting function in the neural network model can be obtained. Finally, the accuracy is compared with a set threshold. When the accuracy is less than or equal to a set threshold, determining that the recognition fitting function does not meet the use requirement, and further debugging corresponding parameters in the hidden layer; until the accuracy is greater than a set threshold.
And when the accuracy of the neural network model feature fitting function is greater than a set threshold, finishing the training of the recognition fitting function and starting the training of the classification function. The classification function may specifically be performed by a classifier. During execution, classifying the elements in the sample document obtained by the identification and fitting function based on the element types in the element map, and determining the element types to which the elements belong; until the determined element category is that the accuracy of the preset element category to which the element belongs is greater than the classification threshold. The following description is made for the element map and the element categories in the element map:
the element map described in the embodiments of the present application may be a structured knowledge map including element categories. In the knowledge graph, the concept and the mutual relationship of all elements in the system document are clearly described. The interrelationship here may be an inclusion relationship, i.e., a relationship of an element to an element category. The element categories may be specific business types, job roles, systems, etc. For example, if a document includes elements of pound to man and dollar to man, both elements can be categorized as foreign exchange. FIG. 2 is a schematic diagram of a partial map of elements. As shown in fig. 2, the element map includes relationships between elements and between element categories and between elements. Wherein, the element category can be foreign currency exchange business, system regulation and work. An element may be a business, a technical specification, a work specification, a plan, a business specification, and so on.
After the elements and the element types to which the elements belong are determined, the element types can be output through an output layer of the neural network model, and the output result can be the element types in all sample documents in the sample document set or the element types and the elements included in the element types.
Further, after the training of the neural network model is completed, the extraction of the element set can be performed on the document described in step 101 through the neural network model. Specifically, the document set may be input into the neural network model, and the element included in each document in the document set and the element category corresponding to any type of element are determined in the neural network model and output, so that the element set is obtained. Then, the element set may include at least one element category in which at least one element may be included.
It should be noted that the neural network model described in the embodiments of the present application mainly refers to a convolutional neural network model. The method described in the embodiment of the present application is also applicable to other deep learning models, and details are not repeated here.
Step 103: and obtaining an element relation graph based on the corresponding relation between the elements in the element set and the document.
Wherein the element relation diagram indicates the existence of any kind of elements in different documents.
Specifically, first, the document identifier included in each element in each class of elements may be extracted from the element set. And then, establishing the corresponding relation between the elements in the element set and the document based on the corresponding relation between the document identifier and the document to obtain an element relation graph.
It is noted that the document identifier is added before the collection of documents is entered into the neural network. That is, a document identifier is added for a second-length corpus in each document. The language segment is composed of at least one character, and the document identifier uniquely identifies the document corresponding to the language segment.
Further, the determination of the difference between any two documents in the document set can be obtained based on the element relationship diagram in this step 104. Specifically, first, based on the existence of any type of element in the element relation diagram, a difference feature between at least two documents is determined. The difference feature may be composed of at least one element, and the element may correspond to the same element category or different element categories. Then, in a first document of the at least two documents, the difference feature is labeled, so that key information corresponding to each document can be obtained. The key information may include whether the element is a newly added element, the importance of the difference feature, the relevance between the difference feature and the corresponding regulation content, and semantic understanding output of the difference feature.
Further, in order to improve the accuracy of the existence of any type of elements in the element relationship diagram, the existence of the elements in the element relationship diagram may be checked based on the document set determined in step 101. Specifically, the method comprises the following steps:
the difference content may be determined using an algorithm based on the longest subsequence or an algorithm based on the edit distance;
the above-mentioned algorithm based on the edit distance may be to determine the difference content between two character strings by counting the number of edits required for conversion between the two character strings.
And the algorithm based on the longest subsequence may be: first, subsequences in two adjacent documents are determined in a document set respectively. And the subsequence indicates sentences in which the characters are arranged according to a set sequence in the two adjacent documents. Then, the longest subsequence in the two adjacent documents is determined through dynamic programming. Then, in the two adjacent documents, comparing any two longest subsequences, and determining two target longest subsequences meeting the similarity requirement. Wherein the longest subsequence indicates the longest subsequence that is consistent with the set subsequence order. Finally, the content with difference in the two target longest subsequences can be determined, and the content with difference is marked as the difference content. The difference content is then compared with the presence of an element in the element relationship graph. That is, based on the difference content, the existence of any type of element in the element relation diagram is adjusted.
As described above for specifying the difference content, the algorithm based on the longest subsequence or the algorithm based on the edit distance may be used alone in combination with NLP (Natural Language Processing). The NLP technology is used for analyzing the difference content only, so that the high-cost and low-efficiency method of manual proofreading is avoided, the purpose of efficiently determining the difference content between at least two documents in a document set is achieved, the flow of related system training is simplified, and the system training efficiency is improved.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating a neural network model, where the neural network model is used to track content differences between documents according to an embodiment of the present disclosure.
First, based on the publishing time, all documents are sorted to obtain a document set, and as shown in fig. 3 (a), the document order in the document set is: document No. one _ v1, document No. one _ v2, document No. two _ v1, document No. two _ v2, and document No. three _ v 1. The first document, the second document and the third document respectively represent different types of institutional documents, the three documents have content association, and v1 and v2 behind the document names respectively indicate version numbers of the documents. For example, document No. v2 is an updated version of document No. v 1.
Then, after determining the set of documents, the set of documents may be input into the neural network model. In the neural network model, an input layer reads a two-dimensional visual pattern sample, namely the input layer carries out vectorization processing on all characters in a document set according to a model translation rule to obtain a plurality of document vectors. Thereafter, all document vectors enter the hidden layer, and feature recognition of each element in the document vectors of the document set is started through the hidden layer. In order to improve the accuracy of the neural network model for character recognition, at least two hidden layers are generally arranged in the neural network model and used for extracting features with different dimensions (angles) from elements representing the same character. In the embodiment of the present application, the number of hidden layers is set to 4, and as shown in part (b) in fig. 3, h1 to h4 respectively represent one hidden layer to perform recognition of character features in different dimensions. Further, after extracting features of different dimensions from the document characters in the document set, fitting can be performed on the character features in the output layer, and elements corresponding to the fitted features obtained by fitting and the element categories to which the elements belong can be determined. As shown in part (c) of fig. 3. In part (c), each form of square represents a class of elements obtained by fitting the character features obtained from h1 to h4, and the class of elements may include elements of the same class but different contents. It should be noted that each element in the element category obtained in part (c) is provided with an identifier of a document, that is, as shown in part (d), each element type includes an identifier of a document name in part (a).
Further, after determining the elements and the element classes to which the elements belong in all the documents in the document set, the elements and the element classes can be output through the neural network model. Meanwhile, the output content further includes the document identifier carried by any type of element shown in part (d). Thus, the element relation graph can be established through the document identifier carried by each type of element. In the element relation diagram, each type of element related to the content in the document set, different element content in each type of element, and the existence of each type of element in each document are included. As shown in fig. 3 (e), the relationship diagram is a partial element relationship diagram. Wherein, 1 indicates that the corresponding document contains the element of the corresponding type, and 0 indicates that the corresponding document does not contain the element of the corresponding type.
In the element relation diagram obtained based on the method, the service line represented by each type of element and the change situation of the related service line in all (institutional) documents can be intuitively and efficiently obtained, so that the purpose of efficiently tracking the difference content among the documents is achieved, and the change situation of the service line where the difference content is located can be rapidly and effectively determined.
Based on the same inventive concept, an embodiment of the present application provides an apparatus for tracking inter-document difference content, where the apparatus corresponds to the method for tracking inter-document difference content shown in fig. 1, and a specific implementation of the apparatus may refer to the description of the foregoing method embodiment, and repeated descriptions are omitted, referring to fig. 4, and the apparatus includes:
the sorting unit 401: the method is used for sequencing at least two documents according to the document publishing time to obtain a document set.
The document is a document containing preset content.
Model unit 402: determining elements in each document in the document set through a neural network model to obtain an element set; the neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained through fitting.
The device for tracking the difference content between the documents further comprises a training unit, wherein the training unit is specifically used for identifying character features of any dimension in a sample document in the neural network model; fitting the character features of different dimensions based on the identified character features of any dimension to obtain fitting features, and determining characters corresponding to the fitting features and elements corresponding to the characters until the accuracy rate of the determined elements, namely preset elements corresponding to the characters, is greater than a set threshold; classifying the elements in the sample document based on element classes in an element map, and determining the element class to which the elements belong; until the accuracy rate of the determined element class is that the element belongs to the preset element class is greater than a classification threshold value.
The training unit is specifically configured to, in the neural network model, convert a corpus of a first length in each sample document in a sample document set into a corresponding document vector; wherein the document vector is extracted based on character features of any dimension of characters in the word segments; identifying elements in the document vector based on a preset rule, and determining character features of any dimension; the preset rule comprises meanings represented by different combination modes among different elements and corresponding relations between the meanings and the character features.
The generation unit 403: the element relation graph is obtained based on the corresponding relation between the elements in the element set and the documents; wherein the element relation graph indicates the existence of any kind of elements in different documents.
The device for tracking the difference content between the documents further comprises an identification unit, specifically configured to add a document identifier to a corpus of a second length in each document; the language segments consist of at least one character, and the document identifiers uniquely identify documents corresponding to the language segments;
the generating unit 403 is specifically configured to extract, in the element set, the document identifier included in each element in each class of elements; and establishing the corresponding relation between the elements in the element set and the document based on the corresponding relation between the document identifier and the document to obtain an element relation graph.
The generating unit 403 is further configured to determine a difference feature between the at least two documents in the element relationship diagram based on the existence condition; wherein the difference feature is comprised of at least one element; labeling the difference characteristics in a first document of the at least two documents to obtain key information corresponding to each document; the key information comprises whether the element is a newly added element, the importance of the difference feature, the relevance of the difference feature and corresponding regulation content and semantic understanding output of the difference feature.
The device for tracking the difference content between the documents further comprises an optimization unit, specifically, an optimization unit, configured to determine subsequences in the two adjacent documents in the document set respectively; wherein the sub-sequence indicates sentences in which characters are arranged in a set order in the two adjacent documents; determining the longest subsequence in the two adjacent documents through dynamic programming; comparing any two longest subsequences in the two adjacent documents, and determining two target longest subsequences meeting the similarity requirement; wherein the longest subsequence indicates a longest subsequence that is consistent with the set subsequence order; determining the content with difference in the two target longest subsequences, and marking the content with difference as the difference content; and adjusting the existence condition of any type of elements in the element relation graph based on the difference content.
Based on the same inventive concept as the tracking method of the difference content between the documents, an embodiment of the present application further provides an electronic device, which can implement the function of the tracking method of the difference content between the documents, please refer to fig. 5, where the electronic device includes:
at least one processor 501 and a memory 502 connected to the at least one processor 501, in this embodiment, a specific connection medium between the processor 501 and the memory 502 is not limited in this application, and fig. 5 illustrates an example where the processor 501 and the memory 502 are connected through a bus 500. The bus 500 is shown in fig. 5 by a thick line, and the connection manner between other components is merely illustrative and not limited thereto. The bus 500 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 5 for ease of illustration, but does not represent only one bus or one type of bus. Alternatively, the processor 501 may also be referred to as a controller, without limitation.
In the embodiment of the present application, the memory 502 stores instructions executable by the at least one processor 501, and the at least one processor 501 can execute the method for tracking the content of the differences between the documents discussed above by executing the instructions stored in the memory 502. The processor 501 may implement the functions of the various modules in the apparatus shown in fig. 4.
The processor 501 is a control center of the apparatus, and may connect various parts of the entire control device by using various interfaces and lines, and perform various functions and process data of the apparatus by operating or executing instructions stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the apparatus.
In one possible design, processor 501 may include one or more processing units and processor 501 may integrate an application processor that handles primarily operating systems, user interfaces, application programs, and the like, and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501. In some embodiments, processor 501 and memory 502 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 501 may be a general-purpose processor, such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method for tracking differences between documents disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
Memory 502, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 502 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 502 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
The processor 501 is programmed to solidify the code corresponding to the method for tracking the content of the document difference described in the foregoing embodiments into the chip, so that the chip can execute the steps of the method for tracking the content of the document difference shown in fig. 1 when running. How to program the processor 501 is well known to those skilled in the art and will not be described in detail herein.
Based on the same inventive concept, the present application further provides a storage medium storing computer instructions, which when executed on a computer, cause the computer to perform the method for tracking the content of the differences between the documents, discussed above.
In some possible embodiments, the aspects of the method for tracking inter-document difference content provided by the present application may also be implemented in the form of a program product, which includes program code for causing the control apparatus to perform the steps in the method for tracking inter-document difference content according to various exemplary embodiments of the present application described above in this specification, when the program product is run on a device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of the method for tracking the difference content between the documents provided in the embodiment of the invention can adopt a portable compact disc read only memory (CD-ROM) and comprises program codes, and can run on a computing device. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the invention. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (15)

1. A method for tracking differential content between documents, the method comprising:
sequencing at least two documents according to the document release time to obtain a document set; the document is a document containing preset content;
determining elements in each document in the document set through a neural network model to obtain an element set; the neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained through fitting;
obtaining an element relation graph based on the corresponding relation between the elements in the element set and the documents; wherein the element relation graph indicates the existence of any kind of elements in different documents.
2. The method of claim 1, wherein determining the elements in each document in the set of documents through a neural network model, prior to obtaining the set of elements, comprises:
in the neural network model, identifying character features of any dimension in a sample document;
fitting the character features of different dimensions based on the identified character features of any dimension to obtain fitting features, and determining characters corresponding to the fitting features and elements corresponding to the characters until the accuracy rate of the determined elements, namely preset elements corresponding to the characters, is greater than a set threshold;
classifying the elements in the sample document based on element classes in an element map, and determining the element class to which the elements belong; until the accuracy rate of the determined element class is that the element belongs to the preset element class is greater than a classification threshold value.
3. The method of claim 2, wherein said identifying, in the neural network model, character features for any dimension in the document comprises:
in the neural network model, converting the speech segment with the first length in each sample document in a sample document set into a corresponding document vector; wherein the document vector is extracted based on character features of any dimension of characters in the word segments;
identifying elements in the document vector based on a preset rule, and determining character features of any dimension; the preset rule comprises meanings expressed by different combination modes among different elements and corresponding relations between the meanings and the character features.
4. The method of any one of claims 1 to 3, wherein determining the elements in each document in the document set through a neural network model comprises, before obtaining the element set:
adding a document identifier aiming at the language segment with the second length in each document; the language segments consist of at least one character, and the document identifiers uniquely identify documents corresponding to the language segments;
then, obtaining an element relation graph based on the corresponding relation between the elements in the element set and the document, including:
extracting the document identifier included in each element in each class of elements in the element set;
and establishing the corresponding relation between the elements in the element set and the document based on the corresponding relation between the document identifier and the document to obtain an element relation graph.
5. The method of claim 4, wherein obtaining an element relation graph based on correspondence between elements in the element set and documents comprises:
determining a difference feature between the at least two documents in the element relationship diagram based on the existence; wherein the difference feature is comprised of at least one element;
labeling the difference characteristics in a first document of the at least two documents to obtain key information corresponding to each document; the key information comprises whether the element is a newly added element, the importance of the difference feature, the relevance of the difference feature and corresponding regulation content and semantic understanding output of the difference feature.
6. The method of claim 5, wherein after obtaining the set of documents, further comprising:
determining subsequences in the two adjacent documents in the document set respectively; wherein the sub-sequence indicates sentences in which characters are arranged in a set order in the two adjacent documents;
determining the longest subsequence in the two adjacent documents through dynamic programming;
comparing any two longest subsequences in the two adjacent documents, and determining two target longest subsequences meeting the similarity requirement; wherein the longest subsequence indicates a longest subsequence that is consistent with the set subsequence order;
determining the content with difference in the two target longest subsequences, and marking the content with difference as the difference content;
and adjusting the existence condition of any type of elements in the element relation graph based on the difference content.
7. An apparatus for tracking content of differences between documents, the apparatus comprising:
a sorting unit: the document processing system is used for sequencing at least two documents according to the document release time to obtain a document set; the document is a document containing preset content;
a model unit: determining elements in each document in the document set through a neural network model to obtain an element set; the neural network model is used for carrying out feature recognition and fitting of at least two dimensions on characters in a document, and determining elements corresponding to fitting features based on the fitting features obtained through fitting;
a generation unit: the element relation graph is obtained based on the corresponding relation between the elements in the element set and the documents; wherein the element relation graph indicates the existence of any kind of elements in different documents.
8. The apparatus according to claim 7, wherein the apparatus further comprises a training unit, in particular for identifying, in the neural network model, character features for any dimension in a sample document; fitting the character features of different dimensions based on the identified character features of any dimension to obtain fitting features, and determining characters corresponding to the fitting features and elements corresponding to the characters until the accuracy rate of the determined elements, namely preset elements corresponding to the characters, is greater than a set threshold; classifying the elements in the sample document based on element classes in an element map, and determining the element class to which the elements belong; until the accuracy rate of the determined element class is that the element belongs to the preset element class is greater than a classification threshold value.
9. The apparatus according to claim 8, wherein said training unit is specifically configured to, in said neural network model, convert a corpus of first length in each of said sample documents in a sample document set into a corresponding document vector; wherein the document vector is extracted based on character features of any dimension of characters in the word segments; identifying elements in the document vector based on a preset rule, and determining character features of any dimension; the preset rule comprises meanings represented by different combination modes among different elements and corresponding relations between the meanings and the character features.
10. The apparatus according to any one of claims 7 to 9, further comprising an identification unit, specifically configured to add a document identifier to the second-length corpus in each document; the language segment consists of at least one character, and the document identifier uniquely identifies a document corresponding to the language segment;
the generating unit is specifically configured to extract, in the element set, the document identifier included in each element in each class of elements; and establishing the corresponding relation between the elements in the element set and the document based on the corresponding relation between the document identifier and the document to obtain an element relation graph.
11. The apparatus of claim 10, wherein the generating unit is further configured to determine a difference feature between the at least two documents in the element relationship graph based on the presence; wherein the difference feature is comprised of at least one element; labeling the difference characteristics in a first document of the at least two documents to obtain key information corresponding to each document; the key information comprises whether the element is a newly added element, the importance of the difference feature, the relevance of the difference feature and corresponding regulation content and semantic understanding output of the difference feature.
12. The apparatus according to claim 11, wherein the apparatus further comprises an optimization unit, in particular for determining in the set of documents the sub-sequences in the two adjacent documents, respectively; wherein the sub-sequence indicates sentences in which characters are arranged in a set order in the two adjacent documents; determining the longest subsequence in the two adjacent documents through dynamic programming; comparing any two longest subsequences in the two adjacent documents, and determining two target longest subsequences meeting the similarity requirement; wherein the longest subsequence indicates a longest subsequence that is consistent with the set subsequence order; determining the content with difference in the two target longest subsequences, and marking the content with difference as the difference content; and adjusting the existence condition of any type of elements in the element relation graph based on the difference content.
13. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the method of any one of claims 1 to 6 when executing the computer program stored on the memory.
14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
15. A computer program product, which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 6.
CN202210233372.3A 2022-03-10 2022-03-10 Method and device for tracking difference content between documents and electronic equipment Pending CN114595675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210233372.3A CN114595675A (en) 2022-03-10 2022-03-10 Method and device for tracking difference content between documents and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210233372.3A CN114595675A (en) 2022-03-10 2022-03-10 Method and device for tracking difference content between documents and electronic equipment

Publications (1)

Publication Number Publication Date
CN114595675A true CN114595675A (en) 2022-06-07

Family

ID=81816677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210233372.3A Pending CN114595675A (en) 2022-03-10 2022-03-10 Method and device for tracking difference content between documents and electronic equipment

Country Status (1)

Country Link
CN (1) CN114595675A (en)

Similar Documents

Publication Publication Date Title
US11348352B2 (en) Contract lifecycle management
CN110163478B (en) Risk examination method and device for contract clauses
CN114168716B (en) Deep learning-based automatic engineering cost extraction and analysis method and device
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN112100401B (en) Knowledge graph construction method, device, equipment and storage medium for science and technology services
CN111309910A (en) Text information mining method and device
CN112199512B (en) Scientific and technological service-oriented case map construction method, device, equipment and storage medium
CN115547466B (en) Medical institution registration and review system and method based on big data
KR20220133894A (en) Systems and methods for analysis and determination of relationships from various data sources
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN114647741A (en) Process automatic decision and reasoning method, device, computer equipment and storage medium
CN111651994B (en) Information extraction method and device, electronic equipment and storage medium
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
Jagdish et al. Identification of end-user economical relationship graph using lightweight blockchain-based BERT model
CN113220885B (en) Text processing method and system
CN113902569A (en) Method for identifying the proportion of green assets in digital assets and related products
CN112417996A (en) Information processing method and device for industrial drawing, electronic equipment and storage medium
CN116186257A (en) Method and system for classifying short texts based on mixed features
CN115309995A (en) Scientific and technological resource pushing method and device based on demand text
CN112487154B (en) Intelligent search method based on natural language
CN115482075A (en) Financial data anomaly analysis method and device, electronic equipment and storage medium
CN114595675A (en) Method and device for tracking difference content between documents and electronic equipment
CN110442716B (en) Intelligent text data processing method and device, computing equipment and storage medium
CN112434889A (en) Expert industry analysis method, device, equipment and storage medium
CN117807482B (en) Method, device, equipment and storage medium for classifying customs clearance notes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination