CN117951256B - Document duplicate checking method based on hierarchical feature vector search - Google Patents

Document duplicate checking method based on hierarchical feature vector search Download PDF

Info

Publication number
CN117951256B
CN117951256B CN202410338736.3A CN202410338736A CN117951256B CN 117951256 B CN117951256 B CN 117951256B CN 202410338736 A CN202410338736 A CN 202410338736A CN 117951256 B CN117951256 B CN 117951256B
Authority
CN
China
Prior art keywords
document
word
vector
semantic
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410338736.3A
Other languages
Chinese (zh)
Other versions
CN117951256A (en
Inventor
张煇
李龙
赵建峰
马小娴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changhe Information Co ltd
Beijing Changhe Digital Intelligence Technology Co ltd
Original Assignee
Changhe Information Co ltd
Beijing Changhe Digital Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changhe Information Co ltd, Beijing Changhe Digital Intelligence Technology Co ltd filed Critical Changhe Information Co ltd
Priority to CN202410338736.3A priority Critical patent/CN117951256B/en
Publication of CN117951256A publication Critical patent/CN117951256A/en
Application granted granted Critical
Publication of CN117951256B publication Critical patent/CN117951256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a document searching and repeating method based on hierarchical feature vector searching, which relates to the technical field of document searching and repeating, and comprises the following steps: acquiring a document set; extracting word vectors from the multiple acquired document sets by using a word frequency method to serve as primary feature vectors; based on the first-level feature vector of each document, expanding Word vectors to Word groups and sentence semantic features based on a character-level Enhanced Word2Vec model to serve as a second-level feature vector; for each document, an LDA topic model is adopted to acquire topic distribution of the document as a three-level feature vector; respectively establishing inverted indexes of the primary feature vector, the secondary feature vector and the tertiary feature vector as an index library; and inputting the document to be checked, sequentially extracting the corresponding primary characteristic vector, the secondary characteristic vector and the tertiary characteristic vector, and calculating the similarity between the document to be checked and each document in the index library through a vector space model. Aiming at the problem of low document duplicate checking precision in the prior art, the application improves the duplicate checking precision.

Description

Document duplicate checking method based on hierarchical feature vector search
Technical Field
The application relates to the technical field of document duplication checking, in particular to a document duplication checking method based on hierarchical feature vector searching.
Background
In the age of today's digital information explosion, the generation and propagation of mass documents has become normal. However, one of the following problems is document review. Document searching is to find out documents with higher similarity from a large number of documents, and has important significance in various fields such as academia, publishing field, news media and the like. However, the conventional document duplication checking method has a plurality of defects, such as low duplication checking precision, low efficiency, and susceptibility to document structure and content change. Therefore, a more efficient and accurate document duplication method is needed to solve these problems.
Traditional document duplication checking methods mainly depend on simple word frequency-based or character-based matching algorithms, and semantic similarity among documents cannot be fully considered, so that the duplication checking result accuracy is low. In addition, as document length increases and content becomes more complex, the efficiency of conventional approaches is also challenged. Therefore, there is a need to develop a more efficient and accurate document duplication method to meet the increasing demands of document processing.
In the related art, for example, a Chinese patent document CN113722427a provides a paper duplicate checking method based on a feature vector space, and a space vector model with duplicate checking documents is constructed; uploading the document with the check and repeat to a server, extracting some basic information of the document, constructing a space vector model of the document, and carrying out similarity measurement: the similarity between the two papers can be measured using the positional relationship of their corresponding feature vectors in space. A cosine method is adopted to start a thread in the background to specially calculate the similarity score between the imported document and the document library. And when the similarity is larger than the threshold value, displaying the duplicate checking result as follows: unqualified check weight; and when the similarity is less than or equal to the threshold value, displaying the duplicate checking result as follows: and checking the weight of the product. However, the scheme only adopts basic information and a space vector model of the document as features, and the feature expression capability is limited and cannot fully represent the semantic content of the document, so that the document duplication checking precision of the scheme needs to be further improved.
Disclosure of Invention
1. Technical problem to be solved
Aiming at the problem of low document duplication checking precision in the prior art, the application provides a document duplication checking method based on hierarchical feature vector search, which improves duplication checking precision through document duplication checking of hierarchical feature vectors.
2. Technical proposal
The aim of the application is achieved by the following technical scheme.
The embodiment of the specification provides a document duplicate checking method based on hierarchical feature vector search, which comprises the following steps: acquiring a document set; extracting word vectors from the multiple acquired document sets by using a word frequency method to serve as primary feature vectors; based on the first-level feature vector of each document, expanding Word vectors to Word groups and sentence semantic features based on a character-level Enhanced Word2Vec model to serve as a second-level feature vector; for each document, an LDA topic model is adopted to acquire topic distribution of the document as a three-level feature vector; respectively establishing inverted indexes of the primary feature vector, the secondary feature vector and the tertiary feature vector as an index library; inputting a document to be checked, sequentially extracting corresponding primary feature vectors, secondary feature vectors and tertiary feature vectors, and calculating the similarity between the document to be checked and each document in an index library through a vector space model; and outputting the retrieval results according to the high-to-low sequence of the similarity.
Specifically, the document set is obtained, the internet public documents including news, academic papers, blogs and the like are crawled, and text contents are stored. Text is cleaned, including preprocessing of transcoding, filtering stop words, etc. The preprocessed document content is stored as a document set D. For each document D in the document set D, the word frequency TF1 of each word appearing in the document D is counted as a local word frequency. And counting the occurrence word frequency TF2 of each word in the whole document set D, and taking the occurrence word frequency as a global word frequency. Laplacian smoothing is performed on TF1, and rollback smoothing is performed on TF 2. Constructing word frequency-inverse document frequency (TF-IDF) characteristics: For each document d, constructing word frequency vectors as first-level feature vectors according to TF-IDF of words of the document d. And processing all the documents in the document set D according to the steps to obtain all the primary feature vectors.
More specifically, in the technical scheme of extracting the first-level feature vector, a word frequency method may be adopted, including: the TF-IDF word frequency method is the most typical word frequency method, and the product of the word frequency TF of the words and the inverse document frequency IDF is calculated to be used as the weight of the words. The BM25 word frequency method is similar to TF-IDF, but calculates word frequency weight through a more complex formula, taking into account the nonlinear contribution of word frequency to weight. Based on word frequency statistics of the context, not only word frequency of the words in the document is counted, but also word frequency characteristics of the context words before and after the words are considered. Based on word frequency statistics of the positions, dividing the document into a plurality of intervals, and counting word frequencies of words in each interval to reflect the distribution of the words in different positions of the document. Based on word frequency statistics of sliding windows, sliding fixed-length windows on the document, and counting word frequencies of words in each window. N-gram word frequency statistics, statistics of word frequency conditions of phrases rather than single words. Based on word frequency statistics of weights, words are given different weights according to parts of speech, stop words and the like, and then word frequencies are counted.
Based on the character-level Enhanced Word2Vec model, the Word2Vec model learns Word vectors through a neural network. The character level Word2Vec not only learns Word vectors of words, but also learns vector representations of characters, thereby modeling Word internal structures. Based on the model, a vector representation of the unregistered word can be learned. In the application, a character level Word2Vec model is used for learning vector representations of document words in a document set, an Enhanced Word2Vec model is used for learning semantic feature representations of phrases and sentences, namely vector representations of phrases and sentences by using a context learning algorithm and utilizing the relation between Word vectors, and single words are expanded to higher-level semantic units. And (3) applying a context learning algorithm, and expanding vector representations of phrases and sentences based on the first-level word vectors to serve as higher-level semantic features.
Wherein LDA (LATENT DIRICHLET Allocation) is a probability topic model of unsupervised learning. It assumes that a document is a mix of topics and that the topic structure in the document collection can be found by learning. In the application, an LDA model is applied to subject learning on a document set, and the subject distribution condition in the document set is found. The topic distribution is one of the results of the LDA model learning, representing the probability distribution that one document involves each topic. If the topic distribution for document d is {0.7,0.2,0.1}, it is stated that 70% of the document discusses the first topic, 20% of the second, 10% of the third. And extracting a topic distribution vector of each document as a three-level feature vector of the document, and representing each topic and probability thereof related to the document.
Specifically, a level one feature vector index creates an inverted list for each word, storing all documents in which it appears. For each document, writing non-zero words and weights in the primary TF-IDF word frequency feature vector into an inverted list of corresponding words. A word-document inverted index is built in an inverted indexing system (e.g., ELASTIC SEARCH). And indexing the secondary feature vectors. For each document, the vector of its secondary sentences/phrases is written into an indexing system. A vector-document index is built in a vector indexing system (e.g., milvus). And (3) indexing the three-level feature vectors, and writing the three-level topic distribution vector of each document into an indexing system. A vector-document index is constructed in a vector indexing system. The three indexes respectively store feature vectors with different granularities, and record documents corresponding to the feature vectors so as to support subsequent similarity query. Index building may be implemented based on open source systems such as ES and Milvus.
Specifically, the characteristics of the document to be checked are extracted, and preprocessing such as word segmentation is performed on the input document. Extracting a first-order word frequency vector: and counting word frequencies and constructing TF-IDF word vectors. Based on the context word vector model, word vectors of phrases and sentences are obtained. Based on the pre-trained LDA model, document topic distribution is predicted. And obtaining a primary word frequency vector, a secondary context vector and a tertiary theme vector of the document. And calculating the similarity, namely calculating cosine similarity between the first-level word vector of the document to be checked and the first-level vector of each document in the index library. And calculating the similarity of the secondary phrase/sentence vectors. And calculating cosine similarity of the three-level theme vector. And summarizing the three-level similarity to obtain the comprehensive similarity score of the document to be checked and each indexed document. And sorting the index library documents according to the similarity scores, and outputting a similar document list from high to low.
Specifically, similarity calculation is performed, and a document to be checked is input. According to the above flow, the similarity of the primary, secondary and tertiary feature vectors of the document to be checked and each document in the index library is calculated. And summarizing the three-level similarity results to obtain the comprehensive similarity score of each index document and the document to be checked. And sequencing and outputting, namely sequencing from high to low according to the comprehensive similarity scores of all the index documents. A similarity threshold is set, and dissimilar documents below the threshold are filtered. And outputting the results after sequencing and filtering as retrieval output, namely outputting a retrieved similar document list from the most similar to the least similar order. And simultaneously outputting the similarity score of each document. The method and the device have the functions of document content viewing, comparison and the like, and are convenient for users to judge the document repetition degree. The similar document list and the similar scores are output after sorting and filtering, so that the similarity degree of the documents can be intuitively presented to the user, and the processing of repeated contents is guided.
Further, obtaining the first-level feature vector further includes: for each document, respectively counting word frequency TF1 of each vocabulary in the corresponding document as a local word frequency, and counting word frequency TF2 of each vocabulary in a document set as a global word frequency; the method comprises the steps of adopting Laplace smoothing to the acquired word frequency TF1 and adopting rollback smoothing to the acquired word frequency TF 2; and weighting and combining the smoothed word frequencies TF1 and TF2 to obtain a first-level feature vector.
Specifically, TF1 (word frequency of the vocabulary in a single document) is calculated for each document d in the document set. The number of occurrences of the vocabulary w in the document d is counted and denoted as n. Normalizing n to obtain word frequency of word w in document d. Repeating the above steps to calculate TF1 for each vocabulary in document d. And calculating TF2 (word frequency of the word in the full-text document set), and counting the occurrence times of the word w in the whole document set D, wherein the occurrence times are recorded as m. Normalizing m to obtain word frequency of word w in document set D. Repeating the steps, calculating TF2 of each vocabulary in the document set D, and respectively counting TF1 and TF2 to obtain bidirectional word frequency characteristics of the vocabulary, wherein the bidirectional word frequency characteristics are used for subsequently constructing word frequency-inverse document frequency (TF-IDF) as a primary word vector.
Specifically, TF1 smoothes the word frequency for each word w in document dA small smoothed value s (e.g., 1) is added. The smoothing formula: /(I). Wherein/>For the vocabulary of document d, s is added for each word, the smooth denominator is added/>. This reduces the effect of zero word frequency on TF-IDF. TF2 is smoothed, for word w, ifIs 0, will/>Set to a small value/>. This can compensate for the IDF value of the word that does not exist in the vocabulary set. /(I)Different values may be taken, such as the inverse of the vocabulary of the entire document set, etc. By respectively processing TF1 and TF2 through Laplace smoothing and rollback smoothing, smoother and continuous word frequency characteristics can be formed, and misleading influence of zero word frequency on the characteristics is avoided.
Specifically, TF1 and TF2 are smoothed, and TF1 and TF2 are processed using laplace smoothing and back-off smoothing, respectively, as described above. Weight combination, defining a weight combination function: Wherein/> And/>Is a weight parameter,/>General arrangement/>Larger, such as α=0.7, β=0.3. /(I)And/>Is the smoothed TF value. Constructing a first-level feature vector, calculating TF-IDF of each word w in the document D as a feature value for each document D in the document set D, forming the first-level feature vector of the document D by the TF-IDF feature value of each word w, and forming a word frequency feature vector with more distinguishing degree by smoothing and weighting fusion of TF1 and TF2 as input of next-level feature extraction.
Further, obtaining the secondary feature vector further includes: according to the extracted first-level feature vector, a Word2Vec model is adopted to obtain Word vectors of all words; according to the word vectors, constructing semantic and part-of-speech association trees between words, wherein the association trees represent vocabulary semantic relations; calculating the semantic weight of each word through a structured scoring algorithm according to the association tree; splicing the word vector and the semantic weight to obtain a semantic word vector; according to the semantic Word vector, adopting an Enhanced Word2Vec model to perform context learning to obtain vector representation of phrases and sentences; a secondary feature vector is output that contains vector representations of words, words and sentences.
Specifically, input: the primary TF-IDF term frequency feature vector for each document D in the document set D. All documents are connected into a large text corpus, a Word2Vec model is trained on the corpus, and the training target is to learn Word vector representation of each vocabulary by adopting CBOW or Skip-Gram methods and the like. For each Word w in the first-level feature vector, searching a Word vector corresponding to the Word w from a trained Word2Vec modelWord vectors/>, of all wordsAnd combining to form a word vector matrix. And learning the vector representation of the phrase and the sentence based on the word vector matrix by using a context word vector model, and finally outputting the secondary feature vector of the document. The Word2Vec is used for pre-training Word vectors, so that a foundation is laid for expanding Word and sentence semantic vectors.
The association tree is a graph structure and consists of nodes and edges connecting the nodes. Nodes represent words, and edges represent association relations between words. The tree structure represents hierarchical associations between words. In the present application, construction: and calculating the semantic similarity of the words based on the word vectors, and connecting the semantically related words to construct a correlation tree. The representation is: the tree structure represents the semantic and part-of-speech associations of the vocabulary. Application: based on the association tree, higher semantic units, such as associations between phrases and sentences, can be found, which is helpful for feature learning. The association tree provides context information for words, enriching the semantic representation capability of word vectors. In summary, the association tree provides support for subsequent semantic feature extraction and similarity calculation by representing association relations between vocabularies.
Specifically, a Word vector is input, which is a Word vector of each Word in the vocabulary, and is obtained through models such as Word2 Vec. And constructing a correlation tree, and calculating cosine similarity between any two word vectors to be used as a word semantic correlation measure. For every two words, if the similarity is greater than the threshold, an edge is established between the two words. The above process is repeated to build a complete graph in the vocabulary. And dividing the complete graph according to the parts of speech and the similarity to obtain the semantic and part of speech association tree of the words. The association tree represents that the words connected in the tree have stronger association in terms of semantics or parts of speech. Words with shorter distances on the tree have closer semantics. The tree represents the vocabulary-hierarchical semantic and part-of-speech structure. By constructing the association tree, the semantic relation among the words can be represented, and a foundation is laid for subsequent feature learning.
Wherein the structured scoring algorithm is an algorithm that learns node (term) weights based on a graph or tree structure. The method calculates the score of the node by combining the local and global association information of the node in a recursion iteration mode. In the present application, input: and constructing a word association tree according to the word vector. The process comprises the following steps: on the association tree, the score of each word node is recursively calculated, reflecting its importance in the overall association structure. And (3) outputting: semantic weight score for each term. Application: the semantic weight of the word is used for feature extraction and similarity calculation, so that the semantic discrimination capability is enhanced. Based on the association tree, the semantic weight of the word is calculated, and the semantic expression of the word is enriched.
Specifically, the word vector and the semantic weight are spliced to obtain a semantic word vector, which comprises the following steps: for each vocabulary in the text, the Word vector is obtained through Word2Vec model learningOn a semantic association tree constructed among the vocabularies, calculating semantic weight/> for each vocabulary through a structured scoring algorithmFor each vocabulary, its word vector/>And semantic weight/>Splicing to obtain an expansion vector: /(I),/>Is a semantic word vector of a word and has both distributed semantic representation of the word vector and structural semantic information of semantic weights. Semantic word vector/>, of all wordsAnd combining to form a semantic word vector matrix of the corpus, and using the semantic word vector matrix as a model for text matching, classification and the like provided as input to the downstream. The word vectors and the semantic weights are fused in a splicing mode, word vector representations with rich semantics are obtained, and more powerful semantic features are provided for subsequent text analysis tasks.
Specifically, the Enhanced Word2Vec model is adopted to obtain phrase and sentence vector representation, word vectors with rich semantics are obtained by splicing Word vectors and semantic weights for each Word in the text, the Enhanced Word2Vec model is trained on the corpus, and the phrase and sentence level vector representation can be obtained by learning the relationship between words according to the context. Input: the semantic word vector is output as word level representation: vector representation of phrases and sentences, process: the Enhanced Word2Vec model obtains the semantic feature vector of the context through a context learning algorithm. And splicing the vector expressions of the words, the phrases and the sentences in the text into a new feature expression, and providing the new feature expression as a secondary semantic feature vector of the text for a subsequent matching task. Semantic feature expansion from Word to context is achieved through the Enhanced Word2Vec model.
Specifically, a secondary feature vector containing Word vectors, word and sentence vector representations is output, the Word vector of each Word in the text is obtained through Word2Vec and other model learning, and the phrase and sentence vector representations in the text are learned through an Enhanced Word2Vec model based on the Word vector. For text d, constructing feature vectors: Word level features: word vectors, phrase-level features, of all words present in text d: vector concatenation of all phrases in text d, sentence-level features: vector concatenation of all sentences in text d,/>= [ Word vector; phrase vectors; sentence vector ]. The secondary eigenvector/>, of the text dAnd the result output of the feature learning can be used for subsequent tasks such as text similarity calculation, text classification and the like. By combining the vector representations of words, phrases and sentences, semantic feature vectors of texts are formed, and semantic information with different granularities is contained.
Further, according to the association tree, calculating the semantic weight of each word through a structured scoring algorithm, including: acquiring an association tree, wherein nodes in the association tree are words, and edges represent semantic relations among the words; calculating cosine similarity among word vectors, setting words with cosine similarity larger than a threshold value as father nodes, and setting words with cosine similarity smaller than the threshold value as child nodes; starting from a root node, setting the weight of each node downwards in turn, wherein the weight of the root nodeThe node weight of the n layer is; For each word, using corresponding father node and son node in the association tree as objects, respectively adopting a character string matching algorithm to calculate semantic relativity of the word and the father node and son node, and recording as/>And/>; The semantic weight of each term is calculated by the following formula: Wherein/> Meaning weights for words,/>Representing node weight of words in the association tree, alpha represents parent node correlation weight coefficient, beta represents child node correlation weight coefficient,/>Representing semantic relevance of a term to a parent node,/>The semantic relevance of the word and the child node is represented, lambda represents a distance attenuation coefficient, the influence degree of the distance on the relevance is controlled, and d represents the distance between the word and the parent-child node in the association tree.
Specifically, in the association tree, a direct edge connection relationship exists between the parent node and the child node. The child node points to the parent node, and the parent node has a longer path to the child node. The hierarchy of the father node is higher than that of the child node, and the semantic coverage is wider. The child node inherits some of the semantic attributes of the parent node. The root node is at the top level of the tree structure, and there is no parent node. Leaf nodes are at the lowest level of the tree, and there are no child nodes. The root node forms a complete path to the leaf node. The semantic coverage of the root node is maximum, and the coverage of the leaf node is minimum. A parent node may be considered the root node of a child node. Leaf nodes may be considered as end child nodes. The node level determines the coverage of semantics from broad to narrow. The parent-child nodes represent direct semantic association, the root and leaf nodes represent hierarchical expansion of semantic coverage, and the relationship between the root and leaf nodes cooperatively form an association tree structure.
Specifically, for each Word w in the text vocabulary, the Word vector is obtained through model learning such as Word2Vec. For any two words/>And/>Calculate its word vector/>And/>Cosine similarity of (c): /(I). For any word pair, if the similarity sim is larger than a preset threshold, adding an undirected edge between the two words, repeating the process, and constructing a complete graph between the whole word list. On the complete graph, dividing is carried out according to the part-of-speech and meaning relation, the graph is divided into a plurality of subtrees, each subtree represents a semantic or part-of-speech association cluster, and finally a set of inter-word semantic association trees is obtained. For any two words/>And/>If/>Then/>For/>If/>Then/>For/>Is a child node of (a). And repeatedly comparing the similarity of any two words in the vocabulary, and dividing the parent-child relationship according to the threshold value to form a hierarchy structure of the words. The words with the parent-child relationships are connected to form a tree structure, and the similarity between the words determines the parent-child relationships in the tree.
Specifically, in the association tree, selecting a node with the largest semantic coverage as a root node, and setting weight for the root nodeStarting from the root node, for each node of the n-th layer: calculating the weight: /(I)Wherein/>And n is the number of layers of the current node. Recursively executing until the bottom leaf node, wherein the weight of the child node decays exponentially with the layer number. And carrying out normalization processing on the weight values of all the nodes to ensure that the sum of the node weights is 1. This completes the process of recursively weighting all nodes in the association tree from top to bottom.
Specifically, a semantic association tree among words is constructed, the nodes represent words, for each word node in the association tree, a father node set F and a child node set C of the word node are found, for each word w, the father node is connected with each father nodeIn F, a string matching score/>, is calculatedAnd each child node/>In C, a string match score/>, is calculated. For each word w, parent node relevance: /(I)Child node correlation: . Finally, semantic relevance measurement of each word with the father node and the child node is obtained, so that context relevance of the word in the association tree can be calculated, and semantic meaning of the word can be measured. Specifically, node weight/> : The importance of the terms in the overall association tree, the tree structure provides global information. Parent node relevance: The semantic consistency of the words and the father node reflects the context semantic. Child node relevance/>: The semantic consistency of the words and the child nodes reflects the context semantics. Distance decay/>: The distance from the parent/child node may weaken the correlation, and the distance decay indicates this effect. Weights α and β: the degree of influence of the parent node and the child node is balanced. Integral integration: and integrating all the factors, and considering global and local semantic information. The structural information of the association tree is fully utilized, and the semantic weight of the word is comprehensively and accurately expressed through the integration of a plurality of semantic related factors.
Further, obtaining the tertiary feature vector includes: preprocessing the collected document set D to obtain a preprocessed document set; Based on document set/>Establishing a document word matrix M, wherein M rows of the matrix represent corresponding documents d, matrix columns represent corresponding words w, and matrix elements represent TF-IDF weights of the words w in the corresponding documents d; presetting the number k of topics, and collecting the documents/>The document d in the document is expressed as word frequency vector, and the document theme distribution/>' is obtained through an LDA modelAnd subject term distribution/>; For each document d, according to the corresponding document topic distribution/>The first N topic subsets are selected as tertiary feature vectors for document d.
Specifically, the document set is expressed asContains n documents/>From the document set/>Extracting all words to construct a vocabulary W, and calculating TF-IDF weight value/>, in d, of each word W for each document d. Constructing a document word matrix M with the size of/>The ith row of the matrix corresponds to document di, and the jth column of the matrix corresponds to vocabulary/>Matrix elements. The matrix M is a TF-IDF weight matrix of the document words and can be used for subsequent analysis of document representation, clustering and the like.
Specifically, obtaining a document topic distribution and a topic word distribution based on an LDA model, and a document set D' including n documentsEach document d is represented as its word frequency vector/>Presetting the number k of topics, constructing an LDA (latent dirichlet allocation) model, inputting a document word frequency vector of a document set D', and learning the LDA model to obtain document-topic allocation/>And topic-word distribution/>。/>Representing the distribution of each document over k topics,/>Representing the distribution of each topic to the vocabulary. /(I)And/>The method can be used as a vector representation of the document and the theme, and can be used for clustering and interpretation of the document theme.
Specifically, three-level feature vectors of the documents are constructed, each document D in the document set D', and the topic distribution corresponding to each document DRepresenting probability distributions for k topics. For each document d, according to/>The k topics are ordered and the top N topics with the highest probability are selected as subset Td. And for each document d, taking N subject word distributions phi in the subject subset Td as features, and splicing the N subject word distributions phi into a three-level feature vector Fd of the document d. Each document d outputs a corresponding tertiary feature vector Fd which merges the topic distribution information of the document d and can be used as a vector representation of the document d at the topic level.
Further, the pretreatment includes: the MIX word segmentation method is adopted for word segmentation, and the conditional random field CRF is adopted for part-of-speech tagging. The MIX word segmentation method is a mixed word segmentation method, and combines the results of a plurality of single word segmentation methods to perform voting or weighted selection so as to obtain a final word segmentation result. In the application, MIX word segmentation method is used, which integrates the results of various models, carries out maximum matching word segmentation, hidden Markov word segmentation and word segmentation based on conditional random field on the text, corrects and adjusts the word segmentation result by combining the context, and finally outputs the word segmentation result of the whole text.
The conditional random field is a probability graph model, and context environment information can be effectively utilized in the labeling task of the sequence data. In the application, on the basis of word segmentation, a Conditional Random Field (CRF) model is used for part-of-speech tagging, the CRF can effectively use context information to tag each word part of speech, tag the grammar attribute of each word part of speech and output word segmentation results with part-of-speech tagging.
Further, the TF-IDF weight of the word in the corresponding document is smoothed by using a Dirichlet prior smoothing method. The TF-IDF weight is a statistical method for evaluating the importance of a term to one of the documents in a corpus or a corpus. In the application, TF-IDF weights of words in corresponding documents are calculated as importance measures of the words.
The Dirichlet prior smoothing method is a Bayesian smoothing technology, and zero probability problem is avoided by adding probability of prior distribution smoothing model. In the application, the TF-IDF weight of the word is smoothed by Dirichlet prior, and the weight value is adjusted to avoid overfitting. The TF-IDF weights are calculated to represent word importance, and the Dirichlet a priori smoothly adjusts the weight distribution. The two are matched for calculating the importance of the words in the document.
Further, a document theme distribution is obtainedAnd subject term distribution/>Comprising: for document set/>Constructing a word bag model representation by adopting word frequency vectors for each document d; generating a topic distribution/>, of a document d by a Dirichlet procedure; Generating word distribution/>, of document d by Dirichlet procedure; According to the topic distribution/>, of document dRandomly selecting a theme z, and then according to the word distribution/>, of the theme zRandomly generating a word w; computing the word frequency vector for each document d is based on the corresponding topic distribution/>Sum word distribution/>Probability of generation/>; Iteratively calculating the document topic distribution/>, of each document d by using a Gibbs sampling algorithmAnd word distribution/>, per topic zTo maximize the probability of generation/>; After the iteration is finished, the topic distribution/>, of all the documents d is distributedCombining to form a document set/>Document topic distribution/>; Word distribution/>, of all topics zCombining to form a document set/>Subject term distribution/>
Specifically, a bag of words model representation is built for each document in a set of documents, denoted as D', containing n documentsFor each document d, counting all words contained in the document d and corresponding word frequencies, and for each document d, constructing word frequency vectors/>, according to word frequency counting results,/>Representation word/>Word frequency in d. Each document d is represented as its word frequency vector, only word frequency information is retained, and contextual information such as word order is ignored. Finally, each document d is expressed as a word frequency word vector, and all document sets form a word bag model expression. The document set expressed by the word bag model can be used as input of tasks such as subsequent text classification and the like.
The Dirichlet process is a type of fractal random process, can be regarded as popularization of infinite-dimension Dirichlet distribution, and is generally used in non-parametric and semi-parametric generation models. In the present application, the topic distribution of each document is generated a priori through the Dirichlet procedureAnd word distribution per topic/>Probability modeling of document-topic and topic-word is achieved. Setting a prior distribution G0 and a centralized parameter gamma of a Dirichlet process, and carrying out a process on a document d: /(I)Dirichlet (γG0), generating a topic distribution for topic z: /(I)Dirichlet (. Beta.G0) generates a word distribution. The probability distribution of the document and the theme can be automatically learned by using Bayesian inference of the Dirichlet process, so that the non-parametric modeling of the theme model is realized.
Specifically, a prior distribution G0 and a centralized parameter gamma of a Dirichlet process are set, and corresponding subject distribution is generated according to G0 for each document d,/>Dirichlet (. Gamma.G0). For each topic z, generating a corresponding word distribution/>, from G0Dirichlet (. Beta.G0). Repeatedly generating distribution for all documents d and topics z to finally obtain the topic distribution thetad of each document d and the vocabulary distribution/> of each topic z. Distribution/>And/>The method can be used for analyzing the document theme, and the probability distribution of the document and the theme is obtained through Bayesian generation of a Dirichlet process.
Specifically, the topic distribution of each document dWord distribution/>, per topic z. For document d, according to distribution/>Randomly sampling to obtain the subject z,/>. Word distribution according to topic z/>Randomly sampling to obtain a word w,. The execution is repeated for document d until the required number of words is generated. And repeatedly carrying out random generation on all the documents to finally obtain a document generation result based on the theme. The method can be used for automatically generating new documents and evaluating the quality of the topic model.
Specifically, the term frequency vector representation of document dTopic distribution of document d/>Word distribution of topic z/>. For each word/>, in document dAccording to/>Calculate word/>Probability of generation/>. For each topic z, calculate/>Integration results in word/>Topic model-based probability of generation/>. All words in document d/>Multiplying the generation probabilities of (a) to obtain the overall generation probability/>, of the document d. Finally, the generation probability/>, based on the corresponding topic model, of each document d is obtained
Specifically, the document set D and the corresponding word frequency vector initialize the topic distributionSum word distribution/>. For each word w of each document d, in Gibbs sampling mode, according to/>And/>And reassigning the theme to w to obtain a global theme distribution result. Updating the topic distribution/>, of each document d according to the topic distribution result of Gibbs samplingAnd word distribution/>, per topic z. With updated/>And/>A document generation probability is calculated. Gibbs sampling and parameter updating are performed multiple times until/>The number of iterations is converged or reached. Obtaining the final/>And/>Estimating result, and making the document set based on the generation probability/> of the topic modelMaximization.
The Gibbs sampling is a markov chain monte carlo method, which iteratively resamples each variable by assuming that the other variable values are unchanged, to achieve the goal of sampling the joint distribution of all variables. In the present application, a topic is iteratively resampled for each word of each document, and a topic distribution and a word distribution are updated according to the sampling result to estimate model parameters, maximizing the generation probability of the document. Initializing topic distributionSum word distribution/>Iterative Gibbs sampling, re-theming the document words, updating theta and phi according to the sampling result, and calculating the document generation probability/>The iteration is repeated until P is maximized. Gibbs sampling learns a topic model that maximizes the document generation probability by iteratively updating model parameters.
Specifically, the document set D' includes n documentsEach document/>Topic distribution/>Word distribution/>, per topic z. Distributing// >, topics of all documentsMaking transverse connection to form document set/>Overall document topic distribution matrix/>. Word distribution/>, of all topics zMaking longitudinal connection to form document set/>Overall subject term distribution matrix/>。/>Representing a set of documents/>Overall document topic distribution, phi represents document set/>Overall subject term distribution. Distribution/>AndThe overall topic structure of the document set is reflected, and the method can be used for analyzing and visually displaying the topics of the document set.
Further, the condition of iteration ending is confusion degree convergence, and the confusion degree is a document setLog likelihood probability on; when the difference between the confusion degrees of two adjacent steps is smaller than a threshold value in the iteration process, the confusion degrees are converged. Specifically, it is defined as a document setThe log likelihood probability reflects the generation capacity of the model to the data. After each iteration is finished, calculating a document set/>, based on the current modelThe log likelihood probability is used as the confusion value of the current wheel. Comparing the confusion values of two adjacent roundsIf the absolute value of diff is less than the preset threshold/>And the confusion degree is considered to be converged, and the iteration is ended. Stopping iteration when the confusion degree converges, or reaching a preset maximum iteration round number. Obtaining final document topic distribution/>And subject term distribution/>And through the confusing convergence condition, the model is prevented from being fitted excessively, and the optimized theme model is output.
Further, constructing an index library includes: an index library is built in distributed index system ELASTIC SEARCH.
Specifically, install deployment ELASTIC SEARCH, download and install ELASTIC SEARCH service, modify profile Elastic search. Yml, launch ELASTIC SEARCH service. An index is created, and a Post request is sent to ELASTIC SEARCH, a custom index is created, and parameters such as index name, fragment number, copy number and the like, such as PUT/my_index, are specified. Defining a map, defining an index map (mapped to fields and types of documents) by a Put request, such as Put/my_index/_mapping, specifying field names, types, analyzers, etc. Indexing the document, indexing the document by Post request, specifying index name, document id, document content, such as Post/my_index/_doc/1. Searching documents, requesting documents in a search index through Get, and designating query conditions by using query parameters, so that a custom index can be created in ELASTIC SEARCH, and indexing and searching of the documents can be performed.
3. Advantageous effects
Compared with the prior art, the application has the advantages that:
(1) Through the extraction and comprehensive consideration of the multi-level feature vectors, the method can more comprehensively analyze the similarity between the documents, thereby improving the accuracy of duplicate checking. Not only the frequency of vocabulary is considered, but also the semantics and the subject information are considered, so that the duplicate checking result is more accurate and reliable;
(2) An index library is established by adopting a distributed index system, and the duplicate checking process is more efficient by means of ELASTIC SEARCH and other technologies. The establishment and the utilization of the inverted index and the pre-calculation of the document feature vector can greatly reduce the calculation time required by the duplicate checking, and improve the duplicate checking efficiency;
(3) Through the comprehensive consideration of multi-level feature vectors, the semantics and the theme features of the document can be better captured, so that the duplicate checking method has more universality and adaptability;
(4) The distributed index system is adopted, so that the method has good expandability. The method can be effectively adapted to process smaller-scale document sets or massive document sets, and can be conveniently and horizontally expanded to cope with increasing document processing demands.
Drawings
The present specification will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:
FIG. 1 is an exemplary flow chart of a document duplication method based on hierarchical feature vector searching according to some embodiments of the present disclosure;
FIG. 2 is an exemplary flow chart for obtaining a primary feature vector according to some embodiments of the present description;
FIG. 3 is an exemplary flow chart for obtaining secondary feature vectors according to some embodiments of the present description;
FIG. 4 is an exemplary flow chart for obtaining tertiary feature vectors according to some embodiments of the present description.
Detailed Description
The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.
FIG. 1 is an exemplary flow chart of a document duplication method based on hierarchical feature vector search, according to some embodiments of the present description, for obtaining a document set; extracting word vectors from the multiple acquired document sets by using a word frequency method to serve as primary feature vectors; based on the first-level feature vector of each document, expanding Word vectors to Word groups and sentence semantic features based on a character-level Enhanced Word2Vec model to serve as a second-level feature vector; for each document, an LDA topic model is adopted to acquire topic distribution of the document as a three-level feature vector; respectively establishing inverted indexes of the primary feature vector, the secondary feature vector and the tertiary feature vector as an index library; inputting a document to be checked, sequentially extracting corresponding primary feature vectors, secondary feature vectors and tertiary feature vectors, and calculating the similarity between the document to be checked and each document in an index library through a vector space model; and outputting the retrieval results according to the high-to-low sequence of the similarity.
FIG. 2 is an exemplary flow chart for obtaining a primary feature vector, according to some embodiments of the present description, for obtaining a set of documents: documents that need to be reviewed are collected. The documents that need to be reviewed are extracted from the database or file system. For example, all history documents are obtained from a corporate knowledge base. The document is preprocessed, including format conversion, removal of invalid content, etc. For example, converting doc format to txt format. And de-duplicating the document set and deleting the completely duplicated document. The repetition may be determined using MD5 or hash values. The document set is randomly divided into a training set, a verification set and a test set. For example, according to 8:1:1 scale division. And saving the processed document set as a text file for analysis of a subsequent topic model. Basic statistics of the document set, such as the number of documents, vocabulary, etc., are recorded. Examples of processed document sets: train. Txt, valid. Txt, test. Txt.
Local word frequency statistics: for each document, the word frequency (TF 1) of each vocabulary in the document is counted. And segmenting each document to obtain a word list. Chinese word segmentation tools such as HanLP may be used. Traversing a word list of the document, counting the occurrence times of each word in the document, and storing the word list as word frequency TF1. And importing the preprocessed document content, and loading the document content into a character string document. Using a chinese word segmentation tool such as jieba to segment the document string to obtain a word list words= [ word 1, word 2, word n ]. An empty dictionary wf= { } is created for storing a mapping of words and word frequencies. Traversing the word list, for each word w: if w is not in the word frequency dictionary wf, wf [ w ] = 1, if w is already in the word frequency dictionary wf, wf [ w ] + = 1. Finally, a word frequency dictionary wf= { word 1 of the document is obtained: word frequency 1, word 2: word frequency 2. And returning word frequency statistical results wf of the document. Thus, by traversing the word list and counting the dictionary, the word frequency TF1 of each word in the document can be counted. Word segmentation result of document doc 1: word= [ word 1, word 2,. ], word n ], word frequency statistics of document doc 1: wf= { word 1: word frequency 1, word 2: word frequency 2. Initializing an empty list vector: vec= [ ], traverse word frequency statistics wf, for each word w and word frequency f: creating word frequency pairs: pair= (w, f), adding pair to vector: vec.application (pair). Converting vec to dictionary format: { word 1: word frequency 1, word 2: word frequency 2. Finally, the word frequency vector of the document doc1 is obtained: doc1= [ word 1: word frequency 1, word 2: word frequency 2..]. And storing the word frequency vector in a persistence mode, and writing the word frequency vector into a file or a database.
Global word frequency statistics: word frequency (TF 2) of each word in the document set is counted. Importing the preprocessed document set, and loading the document set as a document list docs = [ doc1, doc2, ], wherein each document is segmented. An empty dictionary gf= { } is created for counting the global word frequencies of all words in the document set. Traversing the document list docs, for each document doc: traversing each word w in doc, counting the word frequency of w in gf: if w is not in gf, gf [ w ] = 1, if w is already in gf, gf [ w ] += 1. Finally, global word frequency statistics gf= { word 1 of the document set are obtained: global word frequency 1, word 2: global term frequency 2,.. } returns global term frequency statistics gf for the document set. gf statistics were: gf= { "learning": 5, "let": 3, "happy": 2,...}.
Smoothing: and obtaining the smoothed word frequency by using Laplace smoothing and rollback smoothing. Word frequency statistics of document doc: wf= { word 1: word frequency 1, word 2: word frequency 2, &..}, document set global word frequency statistics: gf= { word 1: global word frequency 1,... Document doc word frequency: wf= { 'learning': 5, 'happy': 3, 'knowledge': 0.. } document total word count: n=20, unique word number: m=10, global word frequency: gf= { 'learning': 80, 'knowledge': back-off coefficient: α=0.5. Laplace smoothing, 'learning' smooth word frequency= (5+1)/(20+10) =0.3, and 'happy' smooth word frequency= (3+1)/(20+10) =0.2. Back-off smoothing, 'knowledge' smoothing word frequency = α gf [ 'knowledge' ]/N = 0.5 x 20/20 = 0.5. Update word frequency, wf= { 'learn': 0.3, 'happy': 0.2, 'knowledge': 0.5,...}. And returning the word frequency wf after the smoothing processing of the document doc. Thus, the influence of zero word frequency on the result can be effectively reduced by combining Laplace smoothing and rollback smoothing.
Weighted combination: and carrying out weighted combination on the word frequency after the smoothing treatment to obtain a first-level feature vector. Local word frequency statistics of document doc: wf1, a document set global word frequency statistical result: wf2, smoothing the word frequency: swf1, swf2. For each word w: first order word frequency = weight 1x swf1[ w ] +weight 2 x swf2[ w ], weight 1> weight 2 is typically set. Based on the first-order word frequency, constructing a first-order feature vector of the document doc: vec1= [ word 1: first order word frequency 1, word 2: first order word frequency 2. The doc's primary feature vector is persisted for subsequent use.
FIG. 3 is an exemplary flow chart for acquiring secondary feature vectors, word vector acquisition, according to some embodiments of the present description: based on the first-level feature vector, word2Vec model is used to obtain Word vector of each Word. A set of text documents, for example 100 documents [ doc1, doc2, ]. Is imported. Each document is segmented, preprocessing such as stopping words is removed, and preprocessed document content [ doc1_content, doc2_content, ] is obtained. And carrying out word frequency statistics on each preprocessed document to obtain a word frequency dictionary wf, carrying out smoothing treatment on the wf to obtain a smoothed word frequency swf, and constructing a first-level feature vector according to the swf, for example: doc1_vec1= [ word1: swf1, word2: swf2,.]. And respectively storing the content of the preprocessed document and the primary vector into a file. All the preprocessed document contents are spliced into one large text corpus, such as corpus =doc1_content+doc2_content+. Docn _content. A Word2Vec module is imported, a skip-gram model is defined, and parameters such as Word vector dimension, window size, negative sampling number and the like are set. Corpus inputting the skip-gram model, training word vectors, and iteratively updating the word vectors to obtain a trained word vector vocabulary { word1: vec1, word2: vec2. For the desired word, its corresponding word vector is looked up from the word vector vocabulary, e.g., word 1_vec=model.wv [ 'word1' ]. And returning the pre-trained Word2Vec model and the Word vector vocabulary. doc1 first order eigenvector: doc1_vec1= { word1: freq1, word2: freq2,..zhi }, pre-training word vector dictionary: word_ vecs = { word1: vec1, word2: vec2. Initializing doc1 secondary feature vectors doc1_vec2= [ ], traversing each word in doc1_vec1, searching word corresponding word vectors vec from word_ vecs, and adding (word, vec) to doc1_vec2. Finally, a secondary characteristic vector corresponding to doc1 is obtained: doc1_vec2= [ (word 1, vec 1), (word 2, vec 2), ]. The secondary feature vector doc1_vec2 of the document doc1 is returned.
Semantic word vector generation: and calculating the semantic weight of each word by constructing a semantic and part-of-speech association tree between the words to obtain a semantic word vector. Collecting semantic association relations among words, such as words, anti-meaning words and the like, collecting part-of-speech tagging information of the words, constructing a word relation tree, and representing semantic and part-of-speech association among the words. Semantic association relations among words are collected, an association network graph is constructed, all words are traversed, and initial value weights w word are set to be 1. Multiple iterations are performed on the graph: for each word: obtaining word neighbor word set N (word) from the graph, and updating the word weight: w [ word ] =α Σw [ N ]/|n (word) |, where α is a weight attenuation coefficient and|n (word) |is the number of neighboring words. After multiple iterations, the weight w [ word ] of the word in the semantic relation network is obtained, and the weight w [ word ] reflects the semantic importance of the word. Using w word as the value of the word vector, a semantic word vector is constructed. Word set: word= [ 'apple', 'banana', 'orange',.], semantic weight table: weight= { 'apple': 0.8, 'banana': 0.7, 'orange': 0.5,...}. Generating an initialization zero vector for each word: vector= { 'apple': [0, ], banana': [0,0,..],...}. Traversing each word: acquiring semantic weight of word [ word ], and filling weight value in the first element of null vector [ word ] corresponding to word: vector [ word ] [0] =weight [ word ]. Obtaining a word semantic word vector table: vector= { 'apple': [0.8, 0. ], banana': [0.7,0,...],...}. The semantic word vector of the return word represents vectors.
Contextual learning: and performing context learning by adopting an Enhanced Word2Vec model to obtain vector representations of phrases and sentences and form a secondary feature vector. Preprocessing the text data such as word segmentation and part-of-speech tagging to obtain tagged corpus, importing the preprocessed text corpus corpus, defining a sliding window with a size win_size, for example, set to 5, and for each sentence sentence in corpus: acquiring all words of a sentence, and traversing the words by a sliding window: the center word is words [ i ], the window size is win_size, and the context word is words [ i-win_size: i ] +words [ i+1: i+win_size ]. The center word and the context word form a training sample: sample= (center word, context word 1, context word2,.). And returning all training sample sets obtained by sliding window sampling, and inputting the samples into Word2Vec for training. And importing a Word2Vec module from gensim, generating a Word vector layer by the Word2Vec class, and setting parameters such as Word vector dimension, window size and the like. The Doc2Vec class generates a context vector layer, and parameters such as context vector dimension, negative sampling number and the like are set. An Enhanced Word2Vec class is defined, comprising a Word vector layer and a context vector layer, both layers being trained jointly in the fit () method. The word vector layer learns word vector representations of input words, reflecting semantic information of the words themselves. The context vector layer learns the vector of the input context, reflecting semantic information of the context. Thus, by combining the word vector layer and the context vector layer, the context expression capability of the word vector can be enhanced.
Introducing Enhanced Word2Vec from gensim, and constructing training samples through sliding window sampling: (center Word, context Word 1, context Word2,..) defining an Enhanced Word2Vec model, inputting training samples for training, iteratively optimizing an objective function, and learning a center Word vector and a context vector. The central word vector reflects the word self semantics, the context vector reflects the context semantic information, and the central word vector obtains the context semantics through interaction with the context word vector. And after training is finished, enhanced word vectors are obtained, the word vectors aggregate context information, and the semantic expression capability is enhanced. The trained Enhanced Word2Vec model is persisted.
The pre-trained Enhanced Word2Vec model is loaded from the file, word segmentation processing is carried out on the new input text, a Word list words is obtained, corresponding Word vectors are generated for each Word in the words by the model, and a Word vector list vecs = [ Vec1, vec2, ]. Averaging all word vectors in the text to obtain a text context vector: context_vec=mean (vecs). And using context_vec as a secondary feature vector of the text, and storing the text name and the corresponding secondary feature vector. The enhanced text context vector representation is finally obtained as a secondary feature vector.
FIG. 4 is an exemplary flow chart for obtaining tertiary feature vectors, document preprocessing, according to some embodiments of the present description: the MIX word segmentation method is adopted for word segmentation, and the conditional random field CRF is used for part-of-speech tagging. Importing a text document corpus to be processed, for each sentence sentincorpus: and obtaining a candidate word segmentation result forward by using the forward maximum matching word segmentation, obtaining a candidate word segmentation result backward by using the reverse maximum matching word segmentation, and generating a final word segmentation result seg_send by combining the forward and backward. Constructing CRF training samples by using manually marked corpus, and forming sample formats: (words, labels). And importing a CRF++ tool, defining a feature template, training a CRF model by using a labeling sample, and obtaining the CRF model for part-of-speech labeling. For each word of seg_send, predicting the part of speech by using a CRF model to obtain sentence with part of speech tags.
TF-IDF weight calculation: and calculating TF-IDF weight of the words in the corresponding document, and smoothing by using a Dirichlet prior smoothing method. Counting word frequency TF (t, d) of a word t in a document d, wherein TF weight is TF (t, d). The number df (t) of the documents of the word t in the whole corpus is counted, and the IDF weight is log [ (N+1)/(df (t) +1) ]+1.TF-IDF (t, d) =tf (t, d) ×idf (t). Each document was assigned a smoothing parameter μ using Dirichlet a priori, and the smoothed TF' was calculated: TF' (t, d) = [ TF (t, d) +μ ]/[ Σtf (w, d) +μ ]. Smoothed TF-IDF (t, d) =tf' (t, d) ×idf (t), and the smoothed TF-IDF weight value of the word t in the document d is output.
LDA topic model: preprocessing a document set, establishing a document word matrix, acquiring the topic distribution of the document through an LDA model, and selecting the first N topic subsets as three-level feature vectors of the document. Importing a document set corpus, including a plurality of documents { doc1, doc2, & gt, word segmentation is carried out on each document doc to obtain a word list words, filtering is carried out on the words according to a stop word list, stop words are removed, the occurrence frequency of each word in the words is counted, a word frequency dictionary word_freq is formed, a word frequency dictionary word_freq is built for each document doc, and the words are arranged according to a word sequence to form a document-word frequency matrix D-W. And (5) serializing and saving the document set D-W matrix into a file.
An LDA model is imported from sklearn, the number of topics k is set according to the requirement, for example, k=20, a matrix D-T of the number of topics x of the document is created, each row represents the distribution of one document to k topics, and all elements are initialized to 1/k. A matrix T-W of topic number x vocabulary is created, each row representing a topic versus all words distribution, all elements initialized to 1/vocabulary. And confirming that the D-T matrix and the T-W matrix are correct in shape, wherein the number of D-T lines is equal to the size of the document set, and the number of T-W lines is equal to the number of topics k. Inputting the initialized D-T and T-W into an LDA model for training.
Initializing a document-topic distribution matrix D-T and initializing a topic-word distribution matrix T-W. For each document d: and calculating the probability that the document D contains each topic T according to the current T-W, and updating the distribution probability value of the document D on the topic T in the D-T. For each topic t: and calculating the probability of each word W in the topic T according to the current D-T, and updating the distribution probability value of the topic T on the word W in the T-W. And E, repeating the step E and the step M, and iteratively updating the step D-T and the step T-W until convergence or the maximum iteration number is reached. And obtaining an optimized document-theme distribution matrix D-T. And inputting a document-word frequency matrix D-W, and iteratively training LDA to obtain an optimized document-topic matrix D-T. The topic distribution vector d_topic of each document D is extracted from the D-T, and d_topic represents the probability distribution of the document D on all topics. And sorting the d_topic vectors, reserving N topics with the highest probability, and selecting the top N topic subsets top_ topics of the document d. An N-dimensional topic vector for document d is constructed with an index of topics in top topics, e.g., [2,5, 10] represents a subset of topics for document d. And saving the N-dimensional topic vector of the document d to a file to finally obtain low-dimensional topic feature vector representations of all the documents.
Establishing an inverted index library: and respectively establishing inverted indexes of the first-level, second-level and third-level feature vectors. First-level reverse index, constructing first-level feature vector by terms and TF-IDF weight, and establishing reverse index from terms to documents in the format: { word 1: [ document 1, document 2. ], word 2: [ document 2. ] }. The second-level reverse index constructs the document vector and the similarity into a second-level feature vector, establishes the reverse index from document to document, and has the format: { document 1: [ similar document 1, similar document 2. ], document 2: [ similar document. ] }. Third-level inverted index, constructing third-level feature vector from document theme vector, establishing inverted index from theme to document, format: { topic 1: [ related document 1, related document 2. ], topic 2: [ relevant documents. ]). Leading in the constructed inverted indexes of the first-level, second-level and third-level characteristics, wherein the first-level indexes are as follows: word to document, secondary index: document-to-document, tertiary index: theme to document. A unified key-value pair structure is defined for each level of index. key: query feature (term, document, or topic), value: a list of related documents. And converting the indexes of each level into a unified key value pair format. And merging the three-level indexes together according to key values, wherein the keys are query features, and the values are document lists. And performing de-duplication processing on the document list. And constructing a search library based on the combined inverted index. The search library is saved in the database in a serialization manner.
Inputting a document to be checked: and extracting corresponding first-level, second-level and third-level feature vectors. Inputting doc of the document to be checked, segmenting the doc, extracting word frequency information, and calculating TF-IDF weight of the word as a first-level feature. Generating Word vectors of doc by using a trained Word2Vec model, and averaging to obtain document vectors as secondary characteristics. The topic distribution of doc is inferred by the trained LDA model, and the most probable N topics are selected as three-level features. And checking the first-level, second-level and third-level feature vectors extracted from doc. And using the feature vector for similar document inquiry, and returning a similar document list result of doc.
Similarity calculation: and calculating the similarity between the document to be checked and each document in the index library through a vector space model. And importing numpy and other vector calculation modules, inputting the feature vector a of the document A, and inputting all feature vectors B in the index library document set B. Calculating the L2 norm for a: a_norm=a/np.ling.norm (a), calculate the L2 norm for each B in B: b_norm=b/np. Taking normalized a_norm as a normalized vector of A and taking all b_norm in B as a normalized vector set. In the similarity calculation, normalized vectors a_norm and b_norm are used, so that the vector length is not influenced, and only the similarity of the vector direction is calculated. The normalized vector is stored to a file or database.
Normalized document a vector: a_norm, normalized document B vector: b_norm. The inner product of vectors, a_norm and b_norm, is calculated: dot=np.dot (a_norm, b_norm). Calculating the length of the vector, the L2 norm length of a_norm: l2 norm length of i a i=np. |b|=p.linking.norm (b_norm), and the cosine similarity is calculated: sim=dot/(|a|b|), range judgment, sim value range [0,1],1 represents complete similarity. And storing the result, and storing sim into a database as the similarity of A and B.
And (5) sequencing and outputting results: and outputting the retrieval results according to the high-to-low sequence of the similarity. And loading similarity matrixes sim_matrix of the document A and all the documents in the index library, and ordering each row of sim_matrix in a descending order to obtain a similarity matrix sorted _matrix of the document A and other documents after ordering. The top N indexes and similarities are extracted for each row sorted _matrix as a TopN similar document list for document A. The ranking output is repeated for each document in the index library. TopN-like documents of all documents are saved in the database table. And displaying a similar document list on the front page according to the sorting result, and clicking the document to view the detailed content.
Constructing an index library: an index library is built in distributed index system ELASTIC SEARCH. Download and install ELASTIC SEARCH service, initiate ELASTIC SEARCH cluster. Defining an index name, such as doc_index, setting index-related parameters, such as the number of fragments, the number of copies, etc., and creating a document index through ELASTIC SEARCHAPI. And importing the preprocessed document content in batches into an index, and designating a document field mapping relation. And configuring a custom word segmentation device in the index to support Chinese word segmentation, synonym recognition and the like. Creating an alias for doc_index facilitates index management, such as doc_index_alias. By testing the index and search through Kibana or RestAPI, the configuration of the slices, memories, etc. is adjusted according to the situation, and ELASTIC SEARCH automatically and permanently stores the index data.
The foregoing has been described schematically the application and embodiments thereof, which are not limiting, but are capable of other specific forms of implementing the application without departing from its spirit or essential characteristics. The drawings are also intended to depict only one embodiment of the application, and therefore the actual construction is not intended to limit the claims, any reference number in the claims not being intended to limit the claims. Therefore, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical scheme are not creatively devised without departing from the gist of the present application. In addition, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the inclusion of a plurality of such elements. The various elements recited in the product claims may also be embodied in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (6)

1. A document duplicate checking method based on hierarchical feature vector search comprises the following steps:
Acquiring a document set;
extracting word vectors from the multiple acquired document sets by using a word frequency method to serve as primary feature vectors;
Based on the first-level feature vector of each document, expanding word vectors to phrase and sentence semantic features based on a character level EnhancedWord <2 > Vec model to serve as second-level feature vectors;
For each document, an LDA topic model is adopted to acquire topic distribution of the document as a three-level feature vector;
Respectively establishing inverted indexes of the primary feature vector, the secondary feature vector and the tertiary feature vector as an index library;
inputting a document to be checked, sequentially extracting corresponding primary feature vectors, secondary feature vectors and tertiary feature vectors, and calculating the similarity between the document to be checked and each document in an index library through a vector space model;
outputting search results according to the high-to-low sequence of the similarity;
Obtaining the first-level feature vector further comprises:
For each document, respectively counting word frequency TF1 of each vocabulary in the corresponding document as a local word frequency, and counting word frequency TF2 of each vocabulary in a document set as a global word frequency;
the method comprises the steps of adopting Laplace smoothing to the acquired word frequency TF1 and adopting rollback smoothing to the acquired word frequency TF 2;
Weighting and combining the smoothed word frequencies TF1 and TF2 to serve as a first-level feature vector;
obtaining the secondary feature vector further comprises:
according to the extracted first-level feature vector, a Word2Vec model is adopted to obtain Word vectors of all words;
According to the word vectors, constructing semantic and part-of-speech association trees between words, wherein the association trees represent vocabulary semantic relations;
Calculating the semantic weight of each word through a structured scoring algorithm according to the association tree;
Splicing the word vector and the semantic weight to obtain a semantic word vector;
according to the semantic word vector, carrying out context learning by adopting EnhancedWord < 2 > Vec model to obtain vector representation of phrases and sentences;
Outputting a secondary feature vector comprising a vector representation of words, words and sentences;
according to the association tree, calculating the semantic weight of each word through a structured scoring algorithm, wherein the semantic weight comprises the following steps:
acquiring an association tree, wherein nodes in the association tree are words, and edges represent semantic relations among the words;
Calculating cosine similarity among word vectors, setting words with cosine similarity larger than a threshold value as father nodes, and setting words with cosine similarity smaller than the threshold value as child nodes;
Starting from a root node, setting the weight of each node downwards in sequence, wherein the weight of the root node is w 0 =1, and the weight of the node at the nth layer is w n=w0×0.5n;
For each word, calculating semantic relativity of the word and the father node and the child node by using a character string matching algorithm by taking the corresponding father node and child node in the association tree as objects, and marking the semantic relativity as p_ relativity and c_ relativity;
the semantic weight of each term is calculated by the following formula:
word_weight=wn×(a×e-λd×p_relativity+β×e-λd×c_relativity)
Wherein word_weight represents semantic weight of the word, w n represents node weight of the word in the association tree, alpha represents father node correlation weight coefficient, beta represents child node correlation weight coefficient, p_ relativity represents semantic correlation of the word and father node, c_ relativity represents semantic correlation of the word and child node, lambda represents distance attenuation coefficient, control influence degree of distance on correlation, d represents distance between the word and father and child nodes in the association tree;
Obtaining a three-level feature vector, comprising:
Preprocessing the collected document set D to obtain a preprocessed document set D';
Based on the document set D', a document word matrix M is established, the rows of the matrix M represent corresponding documents D, the matrix array represents corresponding words w, and the matrix elements represent TF-IDF weights of the words w in the corresponding documents D;
presetting the number k of topics, expressing the documents D in the document set D' as word frequency vectors, and acquiring the document topic distribution theta and the topic word distribution through an LDA model
For each document d, the first N topic subsets are selected as tertiary feature vectors of the document d according to the corresponding document topic distribution θd.
2. The document duplication searching method of claim 1, wherein the method is characterized in that:
the pretreatment comprises the following steps: the MIX word segmentation method is adopted for word segmentation, and the conditional random field CRF is adopted for part-of-speech tagging.
3. The document duplication searching method of claim 1, wherein the method is characterized in that:
The TF-IDF weight of the word in the corresponding document is smoothed by using a Dirichlet prior smoothing method.
4. The document duplication searching method of claim 1, wherein the method is characterized in that:
Acquiring document topic distribution theta and topic word distribution Comprising the following steps:
constructing word bag model representation for each document D in the document set D' by using word frequency vectors;
Generating a topic distribution θd of the document d through a Dirichlet process;
generating word distributions for document d by Dirichlet procedure
Randomly selecting a topic z according to the topic distribution of the document d, and then according to the word distribution of the topic zRandomly generating a word w; θd
Computing the term frequency vector for each document d is based on the corresponding topic distribution θd and the term distributionProbability of generation/>
Iteratively calculating a document topic distribution θd for each document d and a word distribution for each topic z using a Gibbs sampling algorithmTo maximize the probability of generation/>
After the iteration is finished, combining the topic distribution theta D of all the documents D to form the document topic distribution theta of the document set D'; word distribution of all topics zCombining, the subject word distribution/>, which constitutes the document set D
5. The document duplication searching method of claim 4, wherein the method is characterized in that:
the condition of the iteration ending is confusion degree convergence, and the confusion degree is log likelihood probability on a document set D';
When the difference between the confusion degrees of two adjacent steps is smaller than a threshold value in the iteration process, the confusion degrees are converged.
6. The document duplication searching method according to any one of claims 1 to 5, wherein:
Constructing an index library, comprising:
An index is established in distributed index system ELASTIC SEARCH.
CN202410338736.3A 2024-03-25 2024-03-25 Document duplicate checking method based on hierarchical feature vector search Active CN117951256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410338736.3A CN117951256B (en) 2024-03-25 2024-03-25 Document duplicate checking method based on hierarchical feature vector search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410338736.3A CN117951256B (en) 2024-03-25 2024-03-25 Document duplicate checking method based on hierarchical feature vector search

Publications (2)

Publication Number Publication Date
CN117951256A CN117951256A (en) 2024-04-30
CN117951256B true CN117951256B (en) 2024-05-31

Family

ID=90805481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410338736.3A Active CN117951256B (en) 2024-03-25 2024-03-25 Document duplicate checking method based on hierarchical feature vector search

Country Status (1)

Country Link
CN (1) CN117951256B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN115630144A (en) * 2022-12-21 2023-01-20 中信证券股份有限公司 Document searching method and device and related equipment
CN116127330A (en) * 2022-09-14 2023-05-16 兰州交通大学 Road network semantic similarity measurement model
CN117034031A (en) * 2023-08-08 2023-11-10 武汉交通职业学院 Sentence generation method and device based on communication scene

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020065857A1 (en) * 2000-10-04 2002-05-30 Zbigniew Michalewicz System and method for analysis and clustering of documents for search engine
US9183288B2 (en) * 2010-01-27 2015-11-10 Kinetx, Inc. System and method of structuring data for search using latent semantic analysis techniques

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN116127330A (en) * 2022-09-14 2023-05-16 兰州交通大学 Road network semantic similarity measurement model
CN115630144A (en) * 2022-12-21 2023-01-20 中信证券股份有限公司 Document searching method and device and related equipment
CN117034031A (en) * 2023-08-08 2023-11-10 武汉交通职业学院 Sentence generation method and device based on communication scene

Also Published As

Publication number Publication date
CN117951256A (en) 2024-04-30

Similar Documents

Publication Publication Date Title
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
CN106776562B (en) Keyword extraction method and extraction system
CN107122413B (en) Keyword extraction method and device based on graph model
CN109858028B (en) Short text similarity calculation method based on probability model
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
EP1573660B1 (en) Identifying critical features in ordered scale space
CN112307364B (en) Character representation-oriented news text place extraction method
CN113704546A (en) Video natural language text retrieval method based on space time sequence characteristics
CN116501875B (en) Document processing method and system based on natural language and knowledge graph
CN113486670B (en) Text classification method, device, equipment and storage medium based on target semantics
CN114238653A (en) Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
CN111241410A (en) Industry news recommendation method and terminal
CN115422372A (en) Knowledge graph construction method and system based on software test
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
CN115422371A (en) Software test knowledge graph-based retrieval method
CN113051886B (en) Test question duplicate checking method, device, storage medium and equipment
US11868313B1 (en) Apparatus and method for generating an article
Iparraguirre-Villanueva et al. Search and classify topics in a corpus of text using the latent dirichlet allocation model
CN117951256B (en) Document duplicate checking method based on hierarchical feature vector search
Zhu et al. A N-gram based approach to auto-extracting topics from research articles1
Freeman et al. Tree view self-organisation of web content
CN112215006B (en) Organization named entity normalization method and system
CN115017260A (en) Keyword generation method based on subtopic modeling
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN114003773A (en) Dialogue tracking method based on self-construction multi-scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant