CN113032575B - Document blood relationship mining method and device based on topic model - Google Patents

Document blood relationship mining method and device based on topic model Download PDF

Info

Publication number
CN113032575B
CN113032575B CN202110588632.4A CN202110588632A CN113032575B CN 113032575 B CN113032575 B CN 113032575B CN 202110588632 A CN202110588632 A CN 202110588632A CN 113032575 B CN113032575 B CN 113032575B
Authority
CN
China
Prior art keywords
document
target
candidate
documents
lda
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110588632.4A
Other languages
Chinese (zh)
Other versions
CN113032575A (en
Inventor
孙孟奇
尤旸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minglue Zhaohui Technology Co Ltd
Original Assignee
Beijing Minglue Zhaohui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minglue Zhaohui Technology Co Ltd filed Critical Beijing Minglue Zhaohui Technology Co Ltd
Priority to CN202110588632.4A priority Critical patent/CN113032575B/en
Publication of CN113032575A publication Critical patent/CN113032575A/en
Application granted granted Critical
Publication of CN113032575B publication Critical patent/CN113032575B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The application provides a document blood relationship mining method and device based on a topic model, wherein the method comprises the following steps: generating a topic model based on the document contents in the document set; screening out candidate documents of the target document from the document set based on the topic model, and adding the target document and the candidate documents into a first document blood relationship clustering set; the similarity between the document contents of the target document and the candidate document is greater than a first preset threshold value; deleting the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set; the editing distance between the target document and the document title of the target candidate document is not smaller than a second preset threshold value; and determining each document in the second document blood relationship clustering set as a consanguinity relationship document. The method and the device can simplify the calculation process of each document, reduce the calculation amount, improve the processing efficiency, are more suitable for processing large-scale documents, and can improve the accuracy of mining the blood relationship of the documents.

Description

Document blood relationship mining method and device based on topic model
Technical Field
The application relates to the technical field of deep learning technology/natural language processing, in particular to a document blood relationship mining method and device based on a topic model.
Background
With the development of society and science and technology, the information age is entered, and almost all enterprises have a large amount of document data and even have many new documents produced every day. For any enterprise, the accumulated document data is very valuable. In the document data accumulated by these enterprises, many documents have version iteration relations, such as different versions of a product description document, and these version iteration relations may also be called blood-related relations between documents. The bloody border relationship among the documents represents the implicit connection among the documents, and the mining of the bloody border relationship of the documents has great help for daily management and retrieval of the document data, and brings great convenience to workers in the process of using the document data.
However, due to the large number of documents, the documents with the same blood relationship are only a few parts of the large number of documents, and a plurality of different blood relationships of the documents appear in the document set, and particularly, the data with artificial marks are few in practical situations, which brings great difficulty to mining the blood relationship of the documents.
In addition to the management method of manually setting document version information, the existing document consanguinity mining schemes generally include the following two schemes.
The first scheme is as follows: and calculating the editing distance between the character strings of the document contents in the document set, and analyzing the similarities and differences between the document contents by comparing the editing distances.
In the first scheme, a short and simpler text has a good effect, but the file consanguinity relationship is a more complex situation, and an accurate judgment cannot be made many times, for example, two files with the file consanguinity relationship may have a large difference in file length and file content. In addition, when a large amount of document data is obtained, the efficiency of comparing the document contents with each other is too low, and the time consumption is long.
Scheme two is as follows: and measuring the similarity of the document contents through the simhash value corresponding to the document contents, and calculating the hamming distance of the simhash value to obtain the similarities and differences among the document contents.
In the second scheme, the similarity of the document contents is judged according to the simhash value corresponding to the document contents, and the similarity of the texts is judged only from the character string structure information of the texts. The text itself has semantic information, and the judgment of the text similarity separated from the semantic information brings larger errors and has larger influence on the result.
Disclosure of Invention
In view of this, an object of the present application is to provide a method and an apparatus for mining a document blood relationship based on a topic model, so as to simplify a calculation process of each document, reduce a calculation amount, improve a processing efficiency, be more suitable for processing a large-scale document, and improve accuracy of mining the document blood relationship.
In a first aspect, an embodiment of the present application provides a document blood relationship mining method based on a topic model, including:
performing model training on the document contents in the document set to generate a theme model;
aiming at a target document in the document set, screening out a candidate document of the target document from the document set based on the topic model, and adding the target document and the candidate document into a first document blood relationship clustering set; the similarity between the document content of the target document and the document content of the candidate document is larger than a first preset threshold value;
deleting the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set; the editing distance between the document title of the target document and the document title of the target candidate document is not smaller than a second preset threshold value;
and determining each document in the second document blood relationship clustering set as a blood relationship document.
In a possible implementation, the performing model training on the document contents in the document set to generate a topic model includes:
acquiring document contents of all documents in a document set stored in an ES index;
and performing model training on the document contents of all the documents in the document set through an LDA algorithm according to the preset number of themes and the training iteration number to generate an LDA theme model.
In one possible implementation, for a target document in the document set, filtering out a candidate document of the target document from the document set based on the topic model, and adding the target document and the candidate document to a first document kindred relation clustering set, includes:
aiming at a target document in the document set, extracting a keyword of the target document according to the document content of the target document;
retrieving the document set based on the keywords of the target document to obtain a candidate document list;
respectively calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on an LDA topic model;
adding the target document into an initial first document blood relationship clustering set, and if the similarity between a first LDA vector and a second LDA vector is greater than a first preset threshold value, adding a candidate document corresponding to the second LDA vector into the first document blood relationship clustering set; the first LDA vector is an LDA vector corresponding to the document content of the target document, and the second LDA vector is an LDA vector corresponding to the document content of the candidate document in the candidate document list.
In a possible implementation manner, the calculating, based on the LDA topic model, LDA vectors corresponding to the target document and the documents in the candidate document list respectively includes:
and respectively calculating LDA vectors corresponding to the document contents of the target document and the documents in the candidate document list based on an LDA topic model.
In a possible implementation manner, the calculating, based on the LDA topic model, LDA vectors corresponding to the target document and the documents in the candidate document list respectively includes:
respectively carrying out document abstract generation on document contents of the target document and each document in the candidate document list;
and respectively calculating LDA vectors corresponding to the document abstracts of the target document and the documents in the candidate document list based on an LDA topic model.
In one possible embodiment, the similarity is calculated by the hellog distance or the JS divergence.
In one possible embodiment, the determining each document in the second document kindred relation clustering set as a consanguinity relation document includes:
marking the same genetic relationship labels for all the documents in the second document genetic relationship clustering set, so as to determine each document in the second document genetic relationship clustering set as a homologous genetic relationship document.
In a second aspect, an embodiment of the present application further provides an apparatus for document blood relationship mining based on a topic model, including:
the generating module is used for carrying out model training on the document contents in the document set to generate a theme model;
the screening module is used for screening candidate documents of the target documents from the document set based on the topic model aiming at the target documents in the document set, and adding the target documents and the candidate documents into a first document blood relationship clustering set; the similarity between the document content of the target document and the document content of the candidate document is larger than a first preset threshold value;
the deleting module is used for deleting the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set; the editing distance between the document title of the target document and the document title of the target candidate document is not smaller than a second preset threshold value;
and the determining module is used for determining each document in the second document blood relationship clustering set as a blood relationship document.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.
In a fourth aspect, this application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.
According to the method for mining the document blood relationship based on the topic model, firstly, in a data preparation stage, model training is carried out on document contents in a document set, a topic model is generated, and preparation is made for subsequent similarity calculation. Secondly, candidate documents of the target document are screened for the first time through document contents, namely, the similarity between the document contents of the target document in the document set and the document contents of other documents is calculated based on the topic model, if the similarity is larger than a first preset threshold value, the document is taken as the candidate document and added into a first document blood relationship clustering set, semantic information is used for judging the similarity of the document contents, and the accuracy is obviously improved. And thirdly, screening the candidate documents of the target document for the second time through the document titles, namely calculating the editing distance between the document titles of the target document and the document titles of the candidate documents, and deleting the candidate documents serving as the target candidate documents from the first document blood relationship clustering set to obtain a second document blood relationship clustering set if the editing distance is not smaller than a second preset threshold value. Through double screening of the document content and the document title, the special condition of 'text is not aligned with the title' can be avoided, and a more accurate result is obtained. And finally, determining each document in the doubly screened second document blood relationship clustering set as a blood relationship document. The method and the device for processing the documents can simplify the calculation process of each document, reduce the calculation amount, improve the processing efficiency, are more suitable for processing large-scale documents, and can improve the accuracy of mining the blood relationship of the documents.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 is a flow chart illustrating a method for topic model based document relationship mining according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating an embodiment of a method for mining the blood relationship of a document with respect to an entire document set according to the present disclosure;
FIG. 3 is a schematic structural diagram illustrating an apparatus for document genetic relationship mining based on a topic model according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
In addition to the management method of artificially setting the document version information, the existing document relationship mining scheme generally includes the following two schemes: the first scheme is as follows: and calculating the editing distance between the character strings of the document contents in the document set, and analyzing the similarities and differences between the document contents by comparing the editing distances. In the first scheme, a short and simpler text has a good effect, but the file consanguinity relationship is a more complex situation, and an accurate judgment cannot be made many times, for example, two files with the file consanguinity relationship may have a large difference in file length and file content. In addition, when a large amount of document data is obtained, the efficiency of comparing the document contents with each other is too low, and the time consumption is long. Scheme II: and measuring the similarity of the document contents through the simhash value corresponding to the document contents, and calculating the hamming distance of the simhash value to obtain the similarities and differences among the document contents. In the second scheme, the similarity of the document contents is judged according to the simhash value corresponding to the document contents, and the similarity of the texts is judged only from the character string structure information of the texts. The text itself has semantic information, and the judgment of the text similarity separated from the semantic information brings larger errors and has larger influence on the result. Based on this, the embodiment of the present application provides a method and an apparatus for document blood relationship mining based on a topic model, which are described below by an embodiment.
For the convenience of understanding the present embodiment, a method for document blood relationship mining based on a topic model disclosed in the embodiments of the present application will be described in detail first.
Referring to fig. 1, fig. 1 is a flowchart of a document blood relationship mining method based on a topic model according to an embodiment of the present application. As shown in fig. 1, the method may include the steps of:
s101, performing model training on the document contents in the document set to generate a theme model;
s102, aiming at a target document in the document set, screening out a candidate document of the target document from the document set based on the topic model, and adding the target document and the candidate document into a first document blood relationship clustering set; the similarity between the document content of the target document and the document content of the candidate document is larger than a first preset threshold value;
s103, deleting the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set; the editing distance between the document title of the target document and the document title of the target candidate document is not smaller than a second preset threshold value;
s104, determining each document in the second document blood relationship clustering set as a blood relationship document.
In step S101, in the data preparation phase, model training is performed on the document contents in the document set to generate a topic model, so as to prepare for subsequent similarity calculation.
Specifically, step S101 may include the following sub-steps:
s1011, acquiring document contents of all documents in a document set stored in the ES index;
s1012, performing model training on the document contents of all the documents in the document set through an LDA algorithm according to the preset number of themes and training iteration times to generate an LDA theme model;
and S1013, saving the generated LDA theme model as an LDA model file.
In step S1011, all documents in the default document set are stored in the ES index, and the document contents of all documents in the document set stored in the ES index are all exported to prepare for the subsequent model training.
In step S1012, before performing model training, the number of topics may be set manually, and the number of topics preset in this embodiment is 300, which is not limited in this embodiment. An lda (content Dirichlet allocation) topic model is used to infer topic distribution of documents, and topics of each document in a document set can be given in a probability distribution form, so that after topic distribution of some documents is extracted by analyzing the documents, clustering of the documents can be performed according to the topic distribution.
In step S1013, the LDA topic model generated in step S1012 is saved as an LDA model file, which is convenient to load later.
In step S102, candidate documents of the target document are primarily screened through the document contents. In this embodiment, similarity between document content of a target document in a document set and document content of other documents is calculated based on a topic model, and then it is determined whether the similarity is greater than a first preset threshold, and if the similarity is greater than the first preset threshold, the corresponding other documents are added to a first document blood-edge relationship clustering set as candidate documents.
Specifically, step S102 may include the following sub-steps:
s1021, aiming at a target document in the document set, extracting a keyword of the target document according to the document content of the target document;
s1022, retrieving the document set based on the keywords of the target document to obtain a candidate document list;
s1023, respectively calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on an LDA topic model;
s1024, adding the target document into an initial first document blood relationship clustering set, and if the similarity between a first LDA vector and a second LDA vector is larger than a first preset threshold value, adding a candidate document corresponding to the second LDA vector into the first document blood relationship clustering set; the first LDA vector is an LDA vector corresponding to the document content of the target document, and the second LDA vector is an LDA vector corresponding to the document content of the candidate document in the candidate document list.
In step S1021, keywords of the target document are extracted according to the document content of the target document. In the process of extracting the keywords, stop words are removed from the document content of the target document, then n keywords corresponding to the target document are extracted according to the word frequency, and n is generally 5 in practical application.
In step S1022, the document set is retrieved by using ES according to the keyword of the target document, and the candidate document list close to the document set is recalled. The candidate document list may include a title of the candidate document, a hash value corresponding to the content of the candidate document, a groupId field value of the candidate document, and the like.
In step S1023, the LDA vector is a probability distribution, unlike the semantic representation vector. The LDA vector can represent semantic information of document content to a certain extent through probability distribution of different subjects. In the present embodiment, the dimension of the LDA vector is 300 dimensions, but the present embodiment is not limited thereto.
In a preferred embodiment, LDA vectors corresponding to the document contents of the target document and the documents in the candidate document list are respectively calculated based on the LDA topic model.
In order to simplify the calculation amount, in another preferred embodiment, document summary generation is performed on the document contents of the target document and each document in the candidate document list respectively; and respectively calculating LDA vectors corresponding to the document abstracts of the target document and the documents in the candidate document list based on the LDA topic model.
In step S1024, an initial first document blood relationship clustering set is set, and the target document is added to the initial first document blood relationship clustering set. And then comparing the target document in the first document blood relationship cluster set with each candidate document in the candidate document list one by one, comparing the similarity between a first LDA vector corresponding to the document content of the target document and a second LDA vector corresponding to the document content of the candidate document in the candidate document list, if the similarity between the first LDA vector and the second LDA vector is greater than a first preset threshold value, adding the candidate document corresponding to the second LDA vector into the first document blood relationship cluster set, and judging the similarity of the document contents by using semantic information, wherein the accuracy is obviously improved.
Wherein the similarity between the first LDA vector and the second LDA vector may be calculated by calculating the hallenge distance or the JS divergence between the first LDA vector and the second LDA vector.
The hellinger distance is used to measure the similarity between two probability distributions. In specific practice, a threshold value (i.e. a first preset threshold value) for similarity of document contents needs to be manually set, and a helling distance greater than 0.75 is usually set to prove that the contents of the two documents are similar, otherwise, the contents are not similar. Specifically, the hellingge distance may be calculated by the following expression:
Figure F_210528094450905_905228001
wherein the probability distribution
Figure F_210528094450983_983353002
The JS divergence is used to measure the similarity of two probability distributions, and generally, the JS divergence is symmetrical and its value is between 0 and 1. Specifically, the JS divergence can be calculated by the following expression:
Figure F_210528094451061_061478003
in step S103, the candidate documents of the target document are secondarily screened by the document title. In the present embodiment, the document title of the target document and the document titles of the candidate documents are first preprocessed to remove misleading stop words, and remove "V", numbers, and the like indicating versions, leaving only the main parts in the titles. And then calculating the editing distance between the document title of the target document and the document title of the candidate document, wherein the titles are not subjected to word segmentation when the editing distance is calculated, and similar titles are directly judged if the editing distance is less than or equal to 2, so that the editing distance conforms to the restriction of the blood relationship of the documents. And deleting the candidate documents of which the editing distance is not less than a second preset threshold value in the first document blood relationship clustering set as target candidate documents from the first document blood relationship clustering set, and reserving the candidate documents of which the editing distance is less than the second preset threshold value in the first document blood relationship clustering set to obtain a second document blood relationship clustering set.
In step S104, the consanguinity documents refer to a plurality of documents with history version iteration satisfying the following characteristics: the content of the document is highly approximate, and the title of the document can embody the iteration of the historical version. And determining each document in the second document blood relationship clustering set after double screening as a consanguineous relationship document. In this embodiment, the same kindred relationship label is marked for all the documents in the second document kindred relationship clustering set, so that each document in the second document kindred relationship clustering set is determined as a consanguineous relationship document.
The method for mining the document blood relationship of a single target document in the document set is described through the steps S101-S104, and the special condition of 'text-to-question' can be avoided through double screening of document contents and document titles, so that a more accurate result is obtained.
In order to more clearly understand the present invention, the method for mining the document blood relationship of the whole document set is specifically described by the following specific embodiments.
Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a method for mining a document blood relationship of a whole document set according to the present application. As shown in fig. 2, the method may include the steps of:
s201, acquiring document contents of all documents in a document set stored in an ES index;
s202, performing model training on the document contents of all the documents in the document set through an LDA algorithm according to the preset number of themes and training iteration times to generate an LDA theme model;
s203, judging whether the document set has unprocessed documents, if so, turning to a step S204, otherwise, turning to a step S212;
s204, obtaining a batch of unprocessed documents as unprocessed documents of the current batch;
s205, aiming at any one target document in the unprocessed documents of the current batch, extracting keywords of the target document according to the document content of the target document;
s206, retrieving the unprocessed documents of the current batch based on the keywords of the target documents to obtain a candidate document list;
s207, respectively calculating LDA vectors corresponding to the document contents of the target document and the documents in the candidate document list based on an LDA topic model;
s208, adding the target document into an initial first document blood relationship clustering set, and if the Hailinger distance between a first LDA vector corresponding to the document content of the target document and a second LDA vector corresponding to the document content of each document in the candidate document list is greater than a first preset threshold value, adding the document corresponding to the second LDA vector into the first document blood relationship clustering set as a candidate document;
s209, if the editing distance between the document title of the target document and the document title of each candidate document in the first document blood relationship clustering set is not smaller than a second preset threshold value, deleting the candidate document serving as the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set;
s210, marking the same blood relationship labels for all the documents in the second document blood relationship clustering set, and updating the marked blood relationship labels into an ES index;
s211, judging whether a target document without a blood relationship label is present in the unprocessed documents of the current batch, if so, turning to a step S205, and if not, turning to a step S203;
s212, pause for a predetermined time, and go to step S203.
In step S203, when it is determined that there is no unprocessed document in the document set, the document set is paused for a predetermined time period T at a predetermined time interval. In this embodiment, the predetermined time period T is 20 minutes, which can prevent the process of the method from continuously occupying resources such as CPU and memory. Within a given duration T of the pause, a new document may be added to the collection. Therefore, when the predetermined time period T of the pause is over, the process returns to step S203, and the determination may be performed again.
In step S204, the newly uploaded document does not have a kindred tag field, i.e., a groupId field, in the ES index, so in this embodiment, only the document without the kindred tag field in the ES index is obtained as all the documents to be processed. When the number of unprocessed documents in the document set is more than the given batch number limit, only one batch number of documents is taken as the unprocessed documents of the current batch. And when the number of the unprocessed documents in the document set is less than the set batch number limit, all the documents to be processed are acquired.
In step S206, according to the keywords of the target document, the ES is used to search the unprocessed documents of the current batch, and a candidate document list close to the unprocessed documents is obtained. The candidate list comprises the title of the candidate document, the hash value corresponding to the content of the candidate document and the groupId field value of the candidate document. If the document in the candidate list has no grouped field value, i.e. it belongs to the newly added document, its grouped field value is marked as "new _ doc".
In step S210, statistics is performed on each document in the second document consanguinity relation clustering set, the most appeared document consanguinity labels are assigned to the documents in the second document consanguinity relation clustering set, all the consanguinity labels in the second document consanguinity relation clustering set are marked as "new _ doc", and the documents are deleted from the to-be-processed document list and are not processed any more. And if the blood relationship label marks of the documents in the second document blood relationship cluster set are all 'new _ doc', taking any one document from the second document blood relationship cluster set, taking the id value of the document as the blood relationship label of the second document blood relationship cluster set, and marking the blood relationship label to each document in the second document blood relationship cluster set.
The labeled consanguineous relationship labels are updated into the ES index. Before updating the blood relationship labels (namely the values of the groupId fields of the document data), searching is carried out in the ES according to the values of the groupId fields of the target documents, and all documents with the same groupId field values as the target documents are found. And updating the consanguineous relation document list of the target document and the documents with the same groupId field value. The method comprises the steps that new documents and old documents are involved, and for the new documents, corresponding groupId field values and a consanguineous relation document list need to be updated in an ES; for the old, only the list of its consanguineous relationship documents is updated.
The process of document blood relationship mining is to allocate blood relationship labels to new documents in an increment mode aiming at the new documents without the assigned document blood relationship labels, and maintain a list of the same blood relationship documents for each document on the basis of the document blood relationship labels, so that query and retrieval are facilitated.
According to the method for mining the document blood relationship based on the topic model, firstly, in a data preparation stage, model training is carried out on document contents in a document set, a topic model is generated, and preparation is made for subsequent similarity calculation. Secondly, candidate documents of the target document are screened for the first time through document contents, namely, the similarity between the document contents of the target document in the document set and the document contents of other documents is calculated based on the topic model, if the similarity is larger than a first preset threshold value, the document is taken as the candidate document and added into a first document blood relationship clustering set, semantic information is used for judging the similarity of the document contents, and the accuracy is obviously improved. And thirdly, screening the candidate documents of the target document for the second time through the document titles, namely calculating the editing distance between the document titles of the target document and the document titles of the candidate documents, and deleting the candidate documents serving as the target candidate documents from the first document blood relationship clustering set to obtain a second document blood relationship clustering set if the editing distance is not smaller than a second preset threshold value. Through double screening of the document content and the document title, the special condition of 'text is not aligned with the title' can be avoided, and a more accurate result is obtained. And finally, determining each document in the doubly screened second document blood relationship clustering set as a blood relationship document. The method and the device for processing the documents can simplify the calculation process of each document, reduce the calculation amount, improve the processing efficiency, are more suitable for processing large-scale documents, and can improve the accuracy of mining the blood relationship of the documents.
Based on the same technical concept, embodiments of the present application further provide a device, an electronic device, a computer storage medium, and the like for document relationship mining based on a topic model, which may specifically refer to the following embodiments.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a document blood relationship mining device based on a topic model according to an embodiment of the present application. As shown in fig. 3, the apparatus may include:
the generating module 10 is configured to perform model training on document contents in a document set to generate a topic model;
a screening module 20, configured to, for a target document in the document set, screen a candidate document of the target document from the document set based on the topic model, and add the target document and the candidate document to a first document blood-related relationship clustering set; the similarity between the document content of the target document and the document content of the candidate document is larger than a first preset threshold value;
a deleting module 30, configured to delete the target candidate document from the first document kindred relationship clustering set, so as to obtain a second document kindred relationship clustering set; the editing distance between the document title of the target document and the document title of the target candidate document is not smaller than a second preset threshold value;
and the determining module 40 is configured to determine each document in the second document blood relationship clustering set as a blood relationship document.
In a possible implementation, the generation module 10 comprises:
the acquisition unit is used for acquiring the document contents of all documents in a document set stored in the ES index;
and the generating unit is used for carrying out model training on the document contents of all the documents in the document set through an LDA algorithm according to the preset number of themes and the training iteration number to generate an LDA theme model.
In one possible embodiment, the screening module 20 comprises:
the extracting unit is used for extracting keywords of the target documents in the document set according to the document contents of the target documents;
the retrieval unit is used for retrieving the document set based on the keywords of the target document to obtain a candidate document list;
the calculation unit is used for respectively calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on an LDA topic model;
the screening unit is used for adding the target document into an initial first document blood relationship clustering set, and if the similarity between a first LDA vector and a second LDA vector is greater than a first preset threshold value, adding a candidate document corresponding to the second LDA vector into the first document blood relationship clustering set; the first LDA vector is an LDA vector corresponding to the document content of the target document, and the second LDA vector is an LDA vector corresponding to the document content of the candidate document in the candidate document list.
In a possible implementation, the computing unit is specifically configured to: and respectively calculating LDA vectors corresponding to the document contents of the target document and the documents in the candidate document list based on an LDA topic model.
In a possible implementation, the computing unit is specifically configured to:
respectively carrying out document abstract generation on document contents of the target document and each document in the candidate document list;
and respectively calculating LDA vectors corresponding to the document abstracts of the target document and the documents in the candidate document list based on an LDA topic model.
In one possible embodiment, the similarity is calculated by the hellog distance or the JS divergence.
In a possible implementation, the determining module 40 is specifically configured to: marking the same genetic relationship labels for all the documents in the second document genetic relationship clustering set, so as to determine each document in the second document genetic relationship clustering set as a homologous genetic relationship document.
An embodiment of the present application discloses an electronic device, as shown in fig. 4, including: a processor 401, a memory 402, and a bus 403, the memory 402 storing machine-readable instructions executable by the processor 401, the processor 401 and the memory 402 communicating via the bus 403 when the electronic device is operating. The machine readable instructions, when executed by the processor 401, perform the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, which is not described herein again.
The computer program product of the method for mining the blood relationship of the document based on the topic model provided in the embodiment of the present application includes a computer readable storage medium storing a nonvolatile program code executable by a processor, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and details are not repeated herein.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. A method for mining a document blood relationship based on a topic model is characterized by comprising the following steps:
performing model training on the document contents in the document set to generate a theme model;
for target documents in the document set, retrieving the document set based on keywords of the target documents to obtain a candidate document list, screening candidate documents of the target documents from the candidate document list based on the topic model, and adding the target documents and the candidate documents into a first document blood relationship clustering set; the similarity between the document content of the target document and the document content of the candidate document is larger than a first preset threshold value;
deleting the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set; the editing distance between the document title of the target document and the document title of the target candidate document is not smaller than a second preset threshold value; when the editing distance is calculated, the document title is not subjected to word segmentation, so that the document title can reflect iteration of a historical version;
determining each document in the second document consanguinity relation clustering set as a consanguinity relation document;
the screening out a candidate document of the target document from the document set based on the topic model aiming at the target document in the document set, and adding the target document and the candidate document into a first document blood relationship clustering set, including:
aiming at a target document in the document set, extracting a keyword of the target document according to the document content of the target document;
retrieving the document set based on the keywords of the target document to obtain a candidate document list;
respectively calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on an LDA topic model;
adding the target document into an initial first document blood relationship clustering set, and if the similarity between a first LDA vector and a second LDA vector is greater than a first preset threshold value, adding a candidate document corresponding to the second LDA vector into the first document blood relationship clustering set; the first LDA vector is an LDA vector corresponding to the document content of the target document, and the second LDA vector is an LDA vector corresponding to the document content of the candidate document in the candidate document list.
2. The method of claim 1, wherein performing model training on the document contents in the document set to generate a topic model comprises:
acquiring document contents of all documents in a document set stored in an ES index;
and performing model training on the document contents of all the documents in the document set through an LDA algorithm according to the preset number of themes and the training iteration number to generate an LDA theme model.
3. The method of claim 1, wherein said calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on the LDA topic model comprises:
and respectively calculating LDA vectors corresponding to the document contents of the target document and the documents in the candidate document list based on an LDA topic model.
4. The method of claim 1, wherein said calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on the LDA topic model comprises:
respectively carrying out document abstract generation on document contents of the target document and each document in the candidate document list;
and respectively calculating LDA vectors corresponding to the document abstracts of the target document and the documents in the candidate document list based on an LDA topic model.
5. The method of claim 1, wherein the similarity is calculated by Hailinger distance or JS divergence.
6. The method of claim 1, wherein determining each document in the second document kindred relationship clustering set as a consanguineous relationship document comprises:
marking the same genetic relationship labels for all the documents in the second document genetic relationship clustering set, so as to determine each document in the second document genetic relationship clustering set as a homologous genetic relationship document.
7. An apparatus for document blood relationship mining based on topic model, comprising:
the generating module is used for carrying out model training on the document contents in the document set to generate a theme model;
the screening module is used for retrieving the document set based on the keywords of the target documents aiming at the target documents in the document set to obtain a candidate document list, screening the candidate documents of the target documents from the candidate document list based on the topic model, and adding the target documents and the candidate documents into a first document blood relationship clustering set; the similarity between the document content of the target document and the document content of the candidate document is larger than a first preset threshold value;
the deleting module is used for deleting the target candidate document from the first document blood relationship clustering set to obtain a second document blood relationship clustering set; the editing distance between the document title of the target document and the document title of the target candidate document is not smaller than a second preset threshold value; when the editing distance is calculated, the document title is not subjected to word segmentation, so that the document title can reflect iteration of a historical version;
the determining module is used for determining each document in the second document blood relationship clustering set as a blood relationship document;
the screening out a candidate document of the target document from the document set based on the topic model aiming at the target document in the document set, and adding the target document and the candidate document into a first document blood relationship clustering set, including:
aiming at a target document in the document set, extracting a keyword of the target document according to the document content of the target document;
retrieving the document set based on the keywords of the target document to obtain a candidate document list;
respectively calculating LDA vectors corresponding to the target document and the documents in the candidate document list based on an LDA topic model;
adding the target document into an initial first document blood relationship clustering set, and if the similarity between a first LDA vector and a second LDA vector is greater than a first preset threshold value, adding a candidate document corresponding to the second LDA vector into the first document blood relationship clustering set; the first LDA vector is an LDA vector corresponding to the document content of the target document, and the second LDA vector is an LDA vector corresponding to the document content of the candidate document in the candidate document list.
8. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 6.
CN202110588632.4A 2021-05-28 2021-05-28 Document blood relationship mining method and device based on topic model Active CN113032575B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110588632.4A CN113032575B (en) 2021-05-28 2021-05-28 Document blood relationship mining method and device based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110588632.4A CN113032575B (en) 2021-05-28 2021-05-28 Document blood relationship mining method and device based on topic model

Publications (2)

Publication Number Publication Date
CN113032575A CN113032575A (en) 2021-06-25
CN113032575B true CN113032575B (en) 2022-05-17

Family

ID=76456158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110588632.4A Active CN113032575B (en) 2021-05-28 2021-05-28 Document blood relationship mining method and device based on topic model

Country Status (1)

Country Link
CN (1) CN113032575B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553825B (en) * 2021-07-23 2023-03-21 安徽商信政通信息技术股份有限公司 Method and system for analyzing context relationship of electronic official document

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631769A (en) * 2012-08-23 2014-03-12 北京百度网讯科技有限公司 Method and device for judging consistency between file content and title
CN104298776A (en) * 2014-11-04 2015-01-21 苏州大学 LDA model-based search engine result optimization system
CN107844493A (en) * 2016-09-19 2018-03-27 上海泓智信息科技有限公司 A kind of file association method and system
CN108829819A (en) * 2018-06-12 2018-11-16 上海智臻智能网络科技股份有限公司 Personalized text recommended method and system, server, readable storage medium storing program for executing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731828B (en) * 2013-12-24 2017-12-05 华为技术有限公司 A kind of cross-cutting Documents Similarity computational methods and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631769A (en) * 2012-08-23 2014-03-12 北京百度网讯科技有限公司 Method and device for judging consistency between file content and title
CN104298776A (en) * 2014-11-04 2015-01-21 苏州大学 LDA model-based search engine result optimization system
CN107844493A (en) * 2016-09-19 2018-03-27 上海泓智信息科技有限公司 A kind of file association method and system
CN108829819A (en) * 2018-06-12 2018-11-16 上海智臻智能网络科技股份有限公司 Personalized text recommended method and system, server, readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN113032575A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
US11341419B2 (en) Method of and system for generating a prediction model and determining an accuracy of a prediction model
US11853334B2 (en) Systems and methods for generating and using aggregated search indices and non-aggregated value storage
US9836541B2 (en) System and method of managing capacity of search index partitions
US11580119B2 (en) System and method for automatic persona generation using small text components
US10824686B2 (en) System and method for searching based on text blocks and associated search operators
US9298757B1 (en) Determining similarity of linguistic objects
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN107169011B (en) Webpage originality identification method and device based on artificial intelligence and storage medium
CN113626443B (en) Index data processing method, device, computer equipment and storage medium
CN113032575B (en) Document blood relationship mining method and device based on topic model
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN114090769A (en) Entity mining method, entity mining device, computer equipment and storage medium
CN114969349B (en) Text processing method and device, electronic equipment and medium
CN111651675A (en) UCL-based user interest topic mining method and device
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN108475265A (en) Obtain the method and apparatus of unregistered word
CN115952800A (en) Named entity recognition method and device, computer equipment and readable storage medium
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
CN113868481A (en) Component acquisition method and device, electronic equipment and storage medium
US11373230B1 (en) Probabilistic determination of compatible content
CN112926297A (en) Method, apparatus, device and storage medium for processing information
Lee et al. Automatic stop word generation for mining software artifact using topic model with pointwise mutual information
CN112632981A (en) New word discovery method and device
US11836176B2 (en) System and method for automatic profile segmentation using small text variations
CN113807429B (en) Enterprise classification method, enterprise classification device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant