CN114580557A - Document similarity determination method and device based on semantic analysis - Google Patents

Document similarity determination method and device based on semantic analysis Download PDF

Info

Publication number
CN114580557A
CN114580557A CN202210240186.2A CN202210240186A CN114580557A CN 114580557 A CN114580557 A CN 114580557A CN 202210240186 A CN202210240186 A CN 202210240186A CN 114580557 A CN114580557 A CN 114580557A
Authority
CN
China
Prior art keywords
document
semantic analysis
keywords
word
compared
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210240186.2A
Other languages
Chinese (zh)
Inventor
程义
李峰
孙正茂
潘磊
杨长青
李君令
张尧尧
郭来中
孙伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongzhi Zhihui Technology Co ltd
Original Assignee
Beijing Zhongzhi Zhihui Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongzhi Zhihui Technology Co ltd filed Critical Beijing Zhongzhi Zhihui Technology Co ltd
Priority to CN202210240186.2A priority Critical patent/CN114580557A/en
Publication of CN114580557A publication Critical patent/CN114580557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a semantic analysis-based document similarity determination method and device, wherein the method comprises the following steps: dividing each document to be compared into a plurality of parts; performing semantic analysis on each part to obtain a semantic analysis result of each part; determining a weight value of each part of each document to be compared according to the semantic analysis result of each part; obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared; and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared. The method and the device can accurately determine the weights of different parts of the literature based on semantic analysis, and further accurately determine the similarity of the literature.

Description

Document similarity determination method and device based on semantic analysis
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for determining document similarity based on semantic analysis.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
At present, in the prior art, when determining the similarity of documents, different weights are preset for different parts of contents of the documents according to manual experience, and finally, the result obtained by weighted summation of the similarity of the contents of the parts is determined as the similarity of the documents according to the manually set fixed weights. The existing method for determining the similarity of the literature sets the weight by experience, and has the problem that the weight setting is inaccurate, so that the similarity of the literature is also inaccurate.
Disclosure of Invention
The embodiment of the invention provides a document similarity determining method based on semantic analysis, which is used for accurately determining the weights of different parts of a document based on the semantic analysis so as to accurately determine the document similarity, and the method comprises the following steps:
dividing each document to be compared into a plurality of parts;
performing semantic analysis on each part to obtain a semantic analysis result of each part;
determining a weight value of each part of each document to be compared according to the semantic analysis result of each part;
obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;
and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
The embodiment of the invention also provides a document similarity determining device based on semantic analysis, which is used for accurately determining the weights of different parts of a document based on the semantic analysis so as to accurately determine the document similarity, and the device comprises:
a dividing unit for dividing each document to be compared into a plurality of parts;
the semantic analysis unit is used for carrying out semantic analysis on each part to obtain a semantic analysis result of each part;
a weight value determining unit for determining a weight value of each part of each document to be compared according to a semantic analysis result of each part;
the processing unit is used for obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;
and the similarity determining unit is used for determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the document similarity determination method based on the semantic analysis.
The embodiment of the invention also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the method for determining the similarity of the documents based on the semantic analysis is realized.
An embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when executed by a processor, the computer program implements the above method for determining similarity of documents based on semantic analysis.
In the embodiment of the invention, compared with the technical scheme that different fixed weights are preset for different parts of contents of a document according to experience in the prior art so as to determine the similarity of the document, the problem that the weight setting is inaccurate so as to determine the similarity of the document is also inaccurate is solved by: dividing each document to be compared into a plurality of parts; performing semantic analysis on each part to obtain a semantic analysis result of each part; determining a weight value of each part of each document to be compared according to the semantic analysis result of each part; obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared; according to the weighted average result of each document to be compared, the similarity between the documents to be compared is determined, and the weights of different parts of the documents can be accurately determined based on semantic analysis, so that the similarity of the documents can be accurately determined.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a schematic flow chart of a document similarity determination method based on semantic analysis according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating semantic analysis performed on each part to obtain a semantic analysis result of each part according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating semantic analysis performed on each part to obtain a semantic analysis result of each part according to another embodiment of the present invention;
FIG. 4 is a schematic flow chart of a document pre-treatment process in an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a document similarity determination apparatus based on semantic analysis according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a semantic analysis unit according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a semantic analysis unit according to another embodiment of the present invention;
FIG. 8 is a diagram illustrating a Chinese word segmentation training process in the professional field according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
Fig. 1 is a schematic flow chart of a document similarity determination method based on semantic analysis in an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step 101: dividing each document to be compared into a plurality of parts;
step 102: performing semantic analysis on each part to obtain a semantic analysis result of each part;
step 103: determining a weight value of each part of each document to be compared according to the semantic analysis result of each part;
step 104: obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;
step 105: and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
The document similarity determining method based on semantic analysis provided by the embodiment of the invention works as follows: dividing each document to be compared into a plurality of parts (which can be called sub-documents); performing semantic analysis on each part to obtain a semantic analysis result of each part; determining a weight value of each part of each document to be compared according to the semantic analysis result of each part; obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared; and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
Compared with the technical scheme that different fixed weights are preset for different parts of contents of a document according to experience in the prior art so as to determine the similarity of the document, and the problem that the weight setting is not accurate so as to determine the similarity of the document is also inaccurate exists, the document similarity determination method based on the semantic analysis can accurately determine the weights of the different parts of the document based on the semantic analysis so as to accurately determine the similarity of the document. The document similarity determination method based on semantic analysis is described in detail below.
First, the above step 101 is described.
In practice, the documents in the examples of the present invention may be patent documents, trademark documents, non-patent documents, and the like. Taking patent documents as an example, dividing each document to be compared into a plurality of parts can be the parts such as the invention name, abstract, claims and specification.
Next, the above step 102 is described.
In one embodiment, as shown in fig. 2, performing semantic analysis on each part to obtain a semantic analysis result of each part may include the following steps:
step 1021: performing word segmentation processing on each part to obtain a plurality of keywords corresponding to each part;
step 1022: extracting a plurality of types of key features from each part according to a plurality of key words corresponding to each part and a preset document feature extraction strategy to form a feature set corresponding to each part;
step 1024: and performing semantic analysis on each part at a word level, a syntax level and a chapter level according to the characteristic set corresponding to each part to obtain a semantic analysis result of each part.
In specific implementation, in step 1021, the document word segmentation process: and performing word segmentation processing on the document according to the topic word library, deleting virtual words which lack practical meanings, low-frequency words which rarely appear and high-frequency words which are frequently used, and finally obtaining a plurality of keywords corresponding to each part.
In specific implementation, a pre-trained word segmentation model can be used for word segmentation. To better understand how the word segmentation process of the present invention is implemented, the word segmentation model is described below.
1. And evaluating the index.
Chinese word segmentation adopts accuracy (Precision) and Recall (Recall) evaluation indexes, wherein:
precision is the number of correctly segmented words/total number of segmented words;
recall is the number of words correctly segmented/total number of words that should be segmented;
the comprehensive performance index F-measure;
Fβ=(β2+1)×Precision×Recall/(β2×Precision+Recall);
beta is a weight factor, if the accuracy and the recall rate are considered to be the same, and beta is taken as 1, the most common F1-measure is obtained;
F1=2×Precisiton×Recall/(Precision+Recall)。
the following description will be given taking the word segmentation model as a chinese word segmentation model.
2. A Chinese word segmentation model training method.
Commonly used word segmentation methods can be divided into two categories: dictionary-based segmentation and statistical-based segmentation. The word segmentation method based on statistics is a mainstream word segmentation method in recent years because the word segmentation method based on statistics is greatly improved in terms of ambiguity segmentation and unknown word recognition compared with word segmentation based on a dictionary. Common word segmentation statistical models include hidden markov models, conditional random field models, maximum entropy models, neural network models, and the like. However, when the fields of the test corpus and the training corpus are inconsistent, the word segmentation accuracy and the unknown word recognition performance are greatly reduced. Therefore, when the professional field text is segmented by using the statistical-based word segmentation method, the labeled training corpus needs to be made for the corresponding field. However, the training corpus for labeling the professional domains consumes a lot of manpower and material resources, and the number of the professional domains having completed labeling at the present stage is rare.
The domain term set embodies and loads the core knowledge of a subject domain, and the domain-of-expertise dictionary is a collection of domain-of-expertise terms. The construction method of the professional domain dictionary can be divided into the following steps: the method comprises the steps of constructing a domain dictionary by utilizing dictionary resources, constructing the domain dictionary by utilizing a corpus and a statistical method, and constructing the domain dictionary by utilizing encyclopedia mining. The inventor proposes a method for constructing a professional domain dictionary by combining dictionary resources and encyclopedia, acquires new terms by using the encyclopedia and improves completeness by using the existing dictionary resources.
The specialized domain Chinese word segmentation training process can be as shown in FIG. 8:
(1) the field dictionary is constructed by mining encyclopedia entries and combining the existing dictionary resources.
(2) And taking the word segmentation result in the basic natural language processing model as a primary word segmentation result, performing reverse maximum matching segmentation on the result by using the domain dictionary, and combining an ambiguity resolution rule to obtain a secondary word segmentation result.
After the word segmentation result is obtained by using the basic natural language processing model word segmentation device, the reverse maximum matching is carried out on the word segmentation result by using the domain dictionary, so that the domain self-applicability of the word segmentation can be improved. And extracting the character string of the word segmentation result from right to left by a reverse maximum matching algorithm, matching in the domain dictionary, and segmenting the character string into words if matching is successful. For example: the rock wool composite board is replaced by a benzene board composite board. "use basic natural language to process the word segmentation of the model word segmentation device, get: "rock/cotton/composite/plate/change/is/benzene/plate/composite/plate/. The field dictionary comprises rock wool, polystyrene boards and composite boards. And performing reverse maximum matching, wherein the adjusted word segmentation result is as follows: "rock wool/composite board/change/benzene board/composite board/. "
The primary word segmentation result is directly subjected to reverse maximum matching adjustment by using a domain dictionary, and ambiguity problems are often faced. For example, "after the work period expires," the one-time word segmentation result is: "on/off/after". "project time" and "expiration" are assumed to be both in the domain dictionary, with two possible outcomes: "on/work/expired/post" and "on/work/full/post". The former result is obtained directly with the inverse maximum matching method, however, the correct segmentation result should be the latter. Therefore, a disambiguation algorithm needs to be designed to improve the word segmentation accuracy.
The design principle is as follows: when the maximum inverse matching algorithm adjusts the initial segmentation result, the following situations occur: keeping the first word segmentation result unchanged; words in the primary word segmentation result are merged without affecting other words; and the words in the primary word segmentation result are re-segmented to form new words. The disambiguation algorithm proposed by us mainly aims at the case three, and the disambiguation algorithm is utilized to decide whether to merge the character strings currently matched to the domain dictionary into a new word. The disambiguation algorithm mainly considers the number of words before and after adjustment, the number of single words and the position characteristics of the words in the words. If the number of the adjusted words is increased, no adjustment is made. If the number of the single words after adjustment is more, no adjustment is made. The position characteristics of the characters refer to 6 character tables about first character, middle character and tail character, which are respectively a first character table (L1) with high single-character word frequency, a first character table (L2) with low single-character word frequency, a middle character table (L3) with high word frequency combined with the first character, a middle character table (L4) with high word frequency combined with the tail character, a tail character table (L5) with high single-character word frequency and a tail character table (L6) with low single-character word frequency.
3. Training process of Chinese word cutting model.
The result of the first-time word segmentation is S: a1a2a3/a4/…/aM-1aM, W0 represents the word currently matched in the domain dictionary, HD (W0) represents the first word of W0, TL (W0) represents the last word of W0, LEN (W0) represents the length of W0, and minSub (S, W0) represents the smallest subsequence containing W0 in S, starting and ending with the slicing interval. combine (S, W0) represents the sequence after merging W0 in S, lmatch (ai) represents that the word is obtained by reverse maximum matching in the domain dictionary with ai as a starting point, and the left segmentation point of the word does not fall in the existing words. new (S, W0) represents a sequence set in which the number of words newly generated after merging W0 in S is 2 or more. Assuming that the first-time word segmentation result is a1a2a3/a 4a5/a6a7a8 and W0 is a5a6, HD (W0) is a5, TL (W0) is a6, minSub (S, W0) is a4a5/a6a7a8, combine (minSub (S, W0), W0) is a4/a5a6/a7a8, combine (minSub (S, W0), and the leftmost end of W0 is the word a 4. If a2a3a4 is in the domain dictionary, LMatch (a4) is empty because the left hand segmentation point falls within the existing word a1a2a 3. new (S, W0) = { a7a8 }.
When LEN (W0) is 3 and the matching word length is 3, whether the left and right side generate the individual word and the word position feature is considered preferentially, then the case that minSub (S, W0) is composed of the individual word and 2 words is considered, and finally the change of the number of individual words and the change of the number of words are considered, and the specific rules are as follows.
Rule 1: if the leftmost end of the combination (minSub (S, W0), W0) is a word, denoted aL, satisfying that lmatch (aL) is null, and the rightmost end of the combination (minSub (S, W0), W0) is not a word, then if aL ∈ L1, and the number of words of the combination (minSub (S, W0), W0) is equal to or less than minSub (S, W0), W0 is merged.
Rule 2: if a combination (minSub (S, W0), W0) is a word (aR) at the rightmost end and not a word at the leftmost end, or a combination (minSub (S, W0), W0) is a word (aL) at the leftmost end, but lmatch (aL) is not empty, then W0 is merged if aR is in the L5 word table and the number of words of the combination (minSub (S, W0), W0) is equal to or less than minSub (S, W0).
Rule 3: MinSub (S, W0) is A/BC, LMatch (A) is not empty, W0 is not merged, and the next match starts with A.
Rule 4: w0 is merged if the number of words in combine (minSub (S, W0), W0) is 0, and the number of words is not more than minSub (S, W0), while the sequence in new (S, W0) is in the NLPIR dictionary or domain dictionary. If the leftmost side of combine (minSub (S, W0), W0) is a uniword aL and lmatch (aL) is not empty, aL is not counted into the uniword.
Rule 5: if W0 is AB and minSub (S, W0) is a/B, indicating a string of length greater than 2, then W0 is not merged if a is in L1, otherwise W0 is merged.
Rule 6: if W0 is AB and minSub (S, W0) is a/B, indicating a string of length greater than 2, then if B is in L5, W0 is not merged, otherwise W0 is merged.
Rule 7: in addition to rules 2-6, W0 is not incorporated.
The above rules are executed in sequence. Compared with a word segmentation device directly used in a basic natural language processing model, the Chinese word segmentation model trained by the Chinese word segmentation method has the advantages that accuracy, recall rate and F value are greatly improved during actual effect verification.
In a specific implementation, in step 1022, the feature extraction (extraction):
in one embodiment, the preset document feature extraction policy may include: according to the occurrence frequency of the keywords in the document, the inverse document frequency of the keywords, the part of speech of the keywords, whether the keywords are professional words or not, the positions of the keywords in the document, the text-rank values of the keywords, the information entropy values of the keywords, the word vector and global deviation values of the keywords, the lengths of the keywords, the keywords serving as the components of the sentence, whether the keywords are divided into sub-keywords or not, the lengths of the positions of the keywords in the document where the keywords occur for the first time and the positions of the keywords where the keywords occur for the last time, and the distribution deviation of the keywords, or any combination thereof, the document characteristics are extracted.
In specific implementation, the document feature extraction strategy can improve the accuracy of feature extraction, so that the precision of semantic analysis is improved, the precision of determination of the weight value of each part of each document to be compared is improved, and finally the precision of determination of the similarity of the documents is improved.
In an embodiment, when the keyword can be segmented into sub-keywords again, the preset document feature extraction policy may further include: and performing document feature extraction according to the word frequency-inverse document frequency of the sub-keywords, the part of speech of the sub-keywords and whether the sub-keywords are one of or any combination of professional words.
In specific implementation, the document feature extraction strategy considering the condition of the sub-keywords can further improve the accuracy of feature extraction, so as to further improve the precision of semantic analysis, further improve the precision of determining the weight value of each part of each document to be compared, and finally further improve the precision of determining the similarity of the documents.
In summary, to facilitate an understanding of how the present invention may be implemented, the document feature extraction strategy is further illustrated in the form of Table 1 below.
In specific implementation, after analyzing data features and business scenarios of patent documents, we propose a specific strategy for extracting features from patent documents by using a natural language processing technology, as shown in table 1 below:
Figure BDA0003541077720000081
Figure BDA0003541077720000091
TABLE 1
As shown in table 1 above, compared with the scheme in the prior art that the document similarity is determined by setting fixed weights for different parts of the document according to experience, the embodiment of the present invention is equivalent to adjusting the weights based on semantic analysis, so as to obtain more accurate weight values, thereby improving the precision of determining the document similarity.
In one embodiment, the plurality of types of key features may include: static features of documents, features of documents associated with queries, and features of queries.
In particular implementations, the features may be sources of nutrients for the algorithms, models, etc. The quality of the feature selection is directly related to the effect of the model learned by algorithm training. Unlike traditional text classification, the MLR in the patent field outputs the ranking of a set of documents for a given query (query or search characteristic), taking into account not only the characteristics of the documents themselves, but also the characteristics of the association relationship between the query and the documents. Taken together, the MLR in the patent domain needs to consider three aspects of features (document static features, document associated with query features, and query features):
the feature selection of patent information means to achieve the purpose of grasping a certain technical development state by inducing, deducing, analyzing, synthesizing, abstracting and summarizing the intrinsic features of patent documents, namely, the technical contents of patents. Specifically, according to the technical contents such as the technical subject, the patent country, the patent inventor, the patent assignee, the patent classification number, the patent application date, the patent authorization date, the patent citation document and the like provided by the patent document, information collection is widely performed, the collected contents are read and written, and on the basis, research activities such as classification, comparison, analysis and the like are further performed on the information to form an organic information set. And then, the patent documents with representative, critical and typical characteristics are intensively researched, and finally, the inherent and even potential interrelations among the patent information are found, so that a relatively complete understanding is formed.
1. Static features of the document itself (document static features) include text features of the document such as weighted word vectors, TF, IDF, BM25 and other language model scores of different domains of the document (title, bibliographic abstract, claims, patent specification, patent text, etc.), and also include importance scores such as quality scores of the document, claims of the patent, etc. Regarding the quality score of the document, the search has different calculation indexes according to different patent classifications, for example, the quality score calculation of the patent document in the mechanical field needs to consider not only the text richness of the patent itself, but also various relevant contents of the patent, such as the change of part of speech, the position of a word, the syntactic structure of the mechanical field, and the like.
2. Document and query associated features (document and query associated features), such as TD-IDF score, BM25 score, etc. of query corresponding documents.
3. query characteristics such as policy text characteristics, weighted patent domain professional word vectors, query length, classification described by the query, sum/avg/min/max/mean score of query BM25, heat of last month of the query, and the like.
In the query and document feature engineering, besides the lexical analysis, the semantic meaning of "really want to express" in the "illustrated" lexical meaning, i.e. the concept, needs to be analyzed and extracted. For example, a word may have multiple meanings, synonyms and synonyms, where the same word may have different meanings in different situations, and different words may have the same meaning in different situations. LSA (latent semantic analysis) is a well-known technique for dealing with such problems, and its main idea is to map a high-dimensional vector space to a low-dimensional potential semantic space or concept space, i.e. to perform dimension reduction. The method specifically comprises the step of carrying out Singular Value Decomposition (SVD) on a lexical item document matrix. In singular value decomposition, C is a matrix (assuming M x N) with document as rows, term terms as columns, and elements as term-inverse document frequency values of term. C is decomposed into 3 small matrix multiplications; each column of U represents a topic, wherein each non-zero element represents the relevance of a topic and an article, and the numerical value is larger and more relevant; v represents the relevance of the keyword to all term; Σ represents the correlation between the article topic and the keyword.
In one embodiment, as shown in fig. 3, the method for determining document similarity based on semantic analysis may further include step 1023: screening and combining features of the feature set corresponding to each part by using a principal component analysis method, a linear discriminant analysis method and a mutual information method to obtain the feature set corresponding to each part after feature dimension reduction processing;
according to the feature set corresponding to each part, performing semantic analysis on each part at a word level, a syntax level and a chapter level to obtain a semantic analysis result of each part, which may include: and performing semantic analysis on each part at word level, syntax level and chapter level according to the feature set corresponding to each part after feature dimension reduction processing to obtain a semantic analysis result of each part.
In practical implementation, the number of words constituting a text is usually extremely large, and thus the dimension of a vector space representing the text is also quite large, which can reach thousands of dimensions, and thus, dimension reduction processing is required. The dimension reduction is generally performed by a feature extraction method, and the feature index representing the vocabulary may include: document frequency, information acquisition, mutual information, evolution fit test, term intensity, etc. And (3) calculating any index of the vocabularies, and then sorting the vocabularies from large to small to select the vocabularies with the specified number or the index values larger than the specified threshold value to form a feature set. The dimension reduction process is further described below.
In specific implementation, after the patent documents are converted into feature sets, that is, in step 1023 after step 1022, many algorithms can directly enter the retrieval and sorting stage, but since the data of the patent documents is too huge and the features of each patent document are very numerous, the calculation speed of the algorithms is crucial, otherwise, the algorithms may look beautiful but have no practical effect. In order to further increase the calculation speed, on the basis of the feature set, the features are screened and combined by using methods such as principal component analysis, linear discriminant analysis, mutual information and the like to obtain new features, so that the purpose of feature dimension reduction is achieved.
In specific implementation, dimensionality reduction is a method for preprocessing high-dimensional feature data. Dimension reduction is to retain data with high dimension with some most important features, and remove noise and unimportant features, thereby achieving the purpose of increasing data processing speed. In actual production and application, dimension reduction can save a great deal of time and cost for people within a certain information loss range. Dimension reduction also becomes a very widely applied data preprocessing method.
In specific implementation, the dimension reduction has the following advantages:
1. making the data set easier to use.
2. The calculation overhead of the algorithm is reduced.
3. And removing the noise.
4. Making the results easy to understand.
In the step 1024, the natural language processing engine is used to perform semantic analysis on the patent documents at word level, syntax level and chapter level, where the word level analysis includes Chinese word segmentation, named entity recognition, part of speech tagging, synonym analysis, word vector analysis, n-gram analysis, word granularity analysis, stop word analysis, and the like; syntactic analysis of patent literature includes dependency grammar analysis, language model, short string analysis, etc.; patent literature chapter level analysis includes patent literature label extraction, topic model, text clustering, and the like.
In specific implementation, in step 1024, the feature evaluation weighting: the calculation method of the feature weight mainly uses a word frequency-inverse document frequency formula. According to different mining purposes, various word frequency-inverse document frequency formula construction methods exist at present. From the above, it can be seen that the construction of the text vector space is completely performed according to the probability statistical rule, and the relation between words is not considered. Another key step in the text mining process is to compute similar distances between text vectors, for any two vectors X ═ (X1, X2, …, xn) and X ═ 1, X '2, …, X' n, there are mainly 3 most common distance measures: euclidean distance, cosine distance and inner product, and the above 3 distance measures do not relate to analysis of the relationship between words in the vector. Through the analysis, the fact that the text vector construction process and the similarity distance measurement process are designed according to the principle of probability statistics and the semantic relation of the features is not considered can be obtained. In the use of natural language, synonyms and associated words (words frequently matched and used according to context) often appear in a large amount in texts for the purpose of expression, for example, in the IT technology, synonyms of 'computer' and 'computer' appear; the probability of the simultaneous occurrence of the police and the case in the judicial field is very high. These synonyms and associated words also appear in the text feature vector in a large number, which increases the dimension of the text feature vector on one hand and reduces the expression precision of the text feature vector to the document on the other hand. Although text feature extraction can reduce the dimensionality of the feature vector through a preset threshold, it is not based on ensuring semantic accuracy, and is often counterproductive. Although the synonym dictionary and the implication dictionary can be used to reduce synonyms and associated words in the process of word segmentation, the problems of dictionary maintenance and updating are brought.
From the above analysis, it can be seen that the association of near-meaning words and related words by given words is very important for both user applications and text semantic mining. The function of the search association model is mainly to solve this problem, and the search association model will be described below.
Third, next, for ease of understanding, the above steps 103 to 105 will be described together.
In specific implementation, the weight value of each part of each document to be compared is determined according to the semantic analysis result of each part; obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared; and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
Fourthly, next, further preferred steps are introduced.
The embodiment of the invention can be used in semantic retrieval, each document in a document database to be retrieved is subjected to document preprocessing, taking a patent document as an example, the document preprocessing of the patent document is converted into semi-structured data containing a pure text by processing, extracting, coding conversion, normalization and the like, the semi-structured data is used as the input of intelligent patent retrieval and is retrieved from the same pre-established semi-structured data, and the document similarity determining method based on semantic analysis provided by the embodiment of the invention is used for comparing and sequencing to obtain an intelligent search result, which can also be called a semantic search result, so that the retrieval precision is high. The document and its processing results are delivered through the Kafka distributed message queue. The whole process is shown in figure 4:
the patent literature data is mainly in the format of XML. First, the needed data needs to be extracted and spliced from the two XML data tables.
The XML file is mostly used for describing information, so that after an XML document is obtained, corresponding information is extracted according to elements in the XML, which is the parsing of the XML. XML analysis has two modes, one mode is DOM analysis, the other mode is SAX analysis, and the DOM analysis mode is adopted by the embodiment of the invention to process the XML.
XML parser based on DOM parsing converts it into a collection of object models, and stores the information in a data structure such as a tree. Through the DOM interface, an application can access any part of data in an XML document at any time, and therefore this way of accessing using the DOM interface is also referred to as random access.
This approach also has a drawback in that the DOM parser converts the whole XML document into a tree and stores the tree in the memory, which requires a higher memory when the document structure is larger or the data is more complex, and traversing the tree with a complex structure is a time-consuming operation. However, the tree structure adopted by the DOM is matched with the mode of XML information storage, and meanwhile, the random access of the DOM can also be utilized, so that the DOM interface still has wide use value.
There are 4 core operation interfaces in the DOM parsing:
document: this interface represents the entire XML document, represented as the root of the entire DOM, i.e., the entry to the tree, through which the contents of all the elements in the XML are accessible. The common method is as follows:
public NodeList getElementByTagName(String tagname);
obtaining NodeList of appointed node name;
Public Element createElement(String tagName)throws DOMException;
creating a node with a specified name;
Public Text createTextNode(String data)throws DOMException;
creating a text content node;
Element createElement(String tagName)throws DOMException;
creating a node element;
Public Attr createAttribute(String name)throws DOMException;
creating an attribute:
2, Node: the interface plays a significant role in the whole DOM tree, and the core interfaces of DOM operation are all inherited from nodes (documents, elements, Attr). In the DOM tree, each Node interface represents a DOM tree Node.
Node interface common method:
Node appendChilid(Node newChild)throws DOMException;
adding a next new node under the current node;
Public NodeList getChildNodes();
acquiring all child nodes under the node;
Public Node getFirstChild();
obtaining a first child node under the node;
Public Node getLastChild();
obtaining the last child node under the node;
Public boolean hasChildNodes();
judging whether other nodes exist or not;
String getNodeValue()throws DOMException;
acquiring node content:
NodeList: this interface represents a collection of points, typically a set of nodes for an ordered relationship. NodeList general method:
Public int getLength();
obtaining the number of nodes in NodeList;
Public Node item(int index);
obtaining a node object according to the index;
NamepnodeMap: the interface represents a one-to-one relationship between a group of nodes and unique names thereof, and is mainly used for representing the node attributes.
In addition to the above four core interfaces, if one program needs to perform DOM parsing operation, the following steps are required:
1. establishing a DocumentBuilder factor for obtaining a DocumentBuilder object:
DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
2. establishing a DocumentBuidler:
DocumentBuilder builder=factory.newDocumentBuilder();
3. establishing a Document object, and acquiring an entrance of a tree:
document doc ═ build. part ("relative path or absolute path of XML file");
4. establishing NodeList:
NodeList n1 ═ doc.
The following describes intelligent retrieval of patent documents by using a semantic analysis-based document similarity determination method provided by an embodiment of the present invention.
1. The recall algorithm is retrieved.
The retrieval system has the most important ordering, and the ordering is divided into primary ordering (retrieval recall) and learning ordering based on business understanding, NLP and machine learning, and the retrieval recall mainly adopts algorithms of word frequency-inverse document frequency, BM25 and the like.
The retrieval recall aims to rapidly acquire the candidate items by using a relatively lightweight algorithm, so that the subsequent sorting by using a complex sorting algorithm is facilitated, the searching efficiency of the whole process is optimized, and unnecessary operation is avoided. The candidates are larger than the target item we will eventually find. For example, we aim to find the most likely 50 patent documents, then the candidates can be set to 500.
(1) Recall based on word frequency-inverse document frequency retrieval
Word frequency-inverse document frequency is a commonly used weighting technique for information retrieval and text mining. Word frequency-inverse document frequency is a statistical method to evaluate how important a word is to one of a set of documents or a corpus of documents. In a given document, term frequency (tf) refers to the frequency with which a given word appears in the document. This number is a normalization of the number of words (term count) to prevent it from biasing towards long documents. For a word ti in a particular document, its importance may be expressed as:
Figure BDA0003541077720000151
in the above equation ni, j is the number of occurrences of the word in the file dj, and the denominator is the sum of the number of occurrences of all words in the file dj.
Reverse document frequency (idf) is a measure of the general importance of a word. Idf for a particular term can be obtained by dividing the total number of documents by the number of documents containing that term, and taking the resulting quotient to be a base-10 logarithm:
Figure BDA0003541077720000152
wherein:
l D |: total number of files in the corpus;
|{j:ti∈dj} |: the number of files containing the word ti (i.e., the number of files ni, j | (0)) results in a denominator of zero if the word is not in the data, so 1+ | { j: t:, is typically usedi∈dj}|;
Then tfidf i, j equals tfi, j × idfi;
a high word frequency within a particular document, and a low document frequency for that word across the entire document collection, may result in a high weight word frequency-inverse document frequency. Thus, word frequency-inverse document frequency tends to filter out common words, leaving important words.
(2) BM 25-based retrieval recall
BM25 is a common formula for scoring relevancy, which is equivalent to extending tfidf, and is mainly to calculate the relevancy of all words and documents in a query, and then add the scores, and the relevancy score of each word is mainly affected by tf/idf, and the original formula of BM25 is as follows:
Figure BDA0003541077720000161
where ri is the number of related documents containing term i, ni is the number of documents containing term i, N is the number of all documents in the entire document dataset, and R is the number of documents related to this query, since in a typical case there is no related information, i.e. ri and R are both 0, while in a normal query there will be no term occurring more than 1, the scoring formula score becomes:
Figure BDA0003541077720000162
the BM25 algorithm, when used to score search relevance, operates primarily as: performing morpheme analysis on the Query to generate morpheme qi; then, for each search result D, calculating a relevance score of each morpheme qi and D, and finally, performing weighted summation on the relevance scores of qi relative to D to obtain a relevance score of Query and D, wherein the formula is transcribed into a general formula of BM25 as follows:
Figure BDA0003541077720000163
wherein Q represents Query, qi represents a morpheme after Q analyzes (Chinese, we can analyze the participle of Query as morpheme, and regard each word as morpheme qi); d represents a search result document; wi represents the weight of the morpheme; r (qi, d) represents the relevance score of morpheme qi and document d.
Wi, the weight of the relevance of a word to a document, is defined below in a number of ways, which can be described in an IDF-like form if it is a general formula for BM25, as follows:
Figure BDA0003541077720000164
where N is the number of documents in the index and N (qi) is the number of documents that contain qi. According to the definition of the IDF, for a given document set, the more documents containing qi, the lower the weight of qi, that is, when many documents contain qi, the lower the degree of distinction of qi is, and thus the lower the importance of using qi to judge the relevance is. Looking again at the relevance score R (qi, d) of morpheme qi and document d, first look at the general form of the relevance score in BM 25:
Figure BDA0003541077720000165
Figure BDA0003541077720000166
where k1.k2.b is an adjustment factor, usually set empirically, typically k1 ═ 2, b ═ 0.75; fi is the frequency of occurrence of qi in d and qfi is the frequency of occurrence of qi in Query. dl is the length of document d and avgdl is the average length of all documents. Since qi occurs only once in Query in most cases, i.e. qfi ═ 1, the formula can be simplified as:
Figure BDA0003541077720000171
as can be seen from the definition of K, the function of parameter b is to adjust the size of the influence of the document length on the relevance. The larger b, the larger the influence of the document length on the relevance score, and vice versa; while the greater the relative length of the document, the greater the value of K will be, the smaller the relevance score will be. This may be understood as the greater the chance of containing qi when the document is longer, and therefore, the relevance of the long document to qi should be weaker than the relevance of the short document to qi for the same fi.
In summary, the correlation score formula of the BM25 algorithm can be summarized as:
Figure BDA0003541077720000172
it can be seen from the formula of BM25 that by using different morpheme analysis methods, morpheme weight determination methods, and morpheme-to-document relevance determination methods, we can derive different search relevance score calculation methods, which provides more flexibility for our algorithm design.
(3) Function-based ranking
Many of the ranking models are function-based, for example, the attenuation function can perform fractional attenuation and reduction weights based on time, number and other dimensions, and the common attenuation functions are gauss attenuation, linear attenuation and the like.
Gauss attenuation: the attenuation function is derived based on a gaussian function, and the formula is defined as follows:
Figure BDA0003541077720000173
the fraction of this formula decays after exceeding origin +/-offset; given scale (attenuation limit) and decay (attenuation degree), the shape of the attenuation function is given, so the calculation method of σ is as follows:
σ2=-scale2/(2×ln(decay))
exponential decay:
S(doc)=exp(λ×max(0,|fieldvaluedoc-origin|-offset))
likewise, the fraction of this formula decays beyond origin +/-offset; given scale (attenuation limit) and decay (attenuation degree), the shape of the attenuation function is also given, so the calculation method of λ is as follows:
λ=ln(decay)/scale
linear attenuation:
Figure BDA0003541077720000181
the calculation method of s is as follows:
s=scale/(1.0-decay));
in comparison to the first two formulas, the score of this formula will become 0 when fieldvalue exceeds twice scale.
2. And (5) retrieving a sorting algorithm.
The essence of the retrieval system is to realize semantic level matching based on the understanding of Query and patent, and then sort the matching candidate results. The sorting includes two stages, i.e. recall stage and refinement stage, the recall stage is finished, TopN candidate results are returned, and in the refinement stage, TopM (N > M) results are sorted accurately based on the machine learning sorting model, and LTR (learning to rank) is selected.
The construction of the above-mentioned search association model using latent semantic analysis and association rule mining is described in detail below.
1. And (4) potential semantic analysis.
Latent semantic analysis LSA reduces synonymous noise by introducing concept space. LSAs are considered to be similar in usage and meaning using contextual relevance of words, i.e., words that appear in similar contexts. To implement the LSA concept, a word-document matrix is first constructed:
A=|aij|m×n;
m represents the total vocabulary, n represents the number of documents, wherein aijNon-negative values indicate the weight of the ith word appearing in the jth document. Different words correspond to different rows of the matrix a, and each document corresponds to a column of the matrix a. Generally aij takes into account contributions from two aspects, namely local weights L (i, j) and global weights C (i, j). In the VSM model, the local weight L (i, j) and the global weight C (i, j) have different weight calculation methods, such as IDF, TFIDF, and the like. A is typically a high-order sparse matrix, since each word will only appear in a small number of documents. Let the ith and jth words correspond to the ith and jth lines of the word-document matrix respectively, and respectively denote ti (ai1, ai2, …, ain) and tj (aj1, aj2, …, ajn), and the similarity is defined as
Figure BDA0003541077720000182
Before calculating the similarity, aij is usually converted into log (aij +1) and divided by its entropy, so that the context of the word can be taken into account by preprocessing, and the context of the word in the article is highlighted. Obtaining a sequenced word-document matrix A '═ a' ij m × n after information entropy transformation, wherein:
Figure BDA0003541077720000191
the theoretical basis of latent semantic analysis is Singular Value Decomposition (SVD) of a matrix, which is a common method in mathematical statistics. After the word-document matrix A ' is established, a k rank approximation matrix A ' k (k < < min (m, n)) of A ' is calculated by using singular value decomposition, and the matrix A ' can be expressed as a product of three matrixes after the singular value decomposition is carried out on the k rank approximation matrix A ' k:
A'=U∑VT
in the formula, U and V are respectively a left singular vector matrix and a right singular vector matrix corresponding to the singular value of A'; Σ is a standard type; VT is the rank of V; singular values of A 'are arranged in a descending manner to form a diagonal matrix sigma k, and k columns at the forefront of U and V are taken to construct a k-rank approximate matrix of A':
Figure BDA0003541077720000192
in the formula, the column vectors of Uk and Vk are both orthogonal vectors, and assuming that the rank of a' is r, there are:
UTU=VTV=Ir(Iris r × r order unit matrix);
from A'kApproximate representation original word-document matrix A', UkAnd VkThe line vectors in the method are respectively used as word vectors and document vectors, and text classification and other various document processing are carried out on the basis, namely the implicit semantic indexing technology. While LSI also represents the semantics of a document with words contained in the document, LSI models do not consider all words in a document as a reliable representation of the concept of a document. Because the semantic structure of the document is covered to a great extent by the diversity of words in the document, the LSI reduces the noise factor contained in the original word-document matrix through singular value decomposition and k rank approximation matrix, thereby more highlighting the semantic relationship between the words and the document; on the other hand, the vector space of words and documents is greatly reduced, and the efficiency of text mining can be improved.
2. And (5) mining association rules.
The association rule is a main mining technology in data mining and can be sent from massive dataPotentially useful associations or correlations. Let I ═ { I1, I2, …, im } be the set of items, let D be the set of transactions T, where transactions T are the set of items, and T ∈ I. The unique identifier corresponding to each transaction is marked as TID. Let X be a collection of items in I, and if X ∈ T, then the transaction T is said to contain X. An association rule is an implication of the form X → Y, where
Figure BDA0003541077720000193
And the support of X ═ Y ═ Φ rule X → Y in transaction set D is the ratio of the number of transactions in the transaction set containing X and Y to the number of all transactions, noted support (X → Y), i.e.:
Figure BDA0003541077720000201
the confidence level of the rule X → Y in the transaction set is the ratio of the number of transactions containing X and Y to the number of transactions containing X, and is denoted as confidence (X → Y), that is:
Figure BDA0003541077720000202
given a transaction set D, the problem of association rule mining is to generate association rules with the support degree and the credibility respectively greater than the minimum support degree and the minimum credibility given by a user, and common algorithms include an Apriori algorithm and a FP-Growth algorithm.
For related word mining, the mined association rule is shown as { ti → tj, s, c }, which indicates that the word ti appears in the document, the support degree of the word tj appearing in the same document is s (s is more than or equal to 0 and less than or equal to 1), and the confidence degree is c (c is more than or equal to 0 and less than or equal to 1). If the support degree and the confidence degree are larger than the specified threshold value, the relevance of the support degree and the confidence degree can be considered to be large, and the relevance is large enough that even if the words are ignored, information loss can not be caused, and the independence of the words is guaranteed.
3. And (5) performing an algorithm process.
The construction algorithm for analyzing the synonyms and related word sets according to the above principle is described as follows:
algorithm 1: a synonym set construction algorithm.
Inputting:
1) training a document set;
2) a conceptual space dimension k;
3) and synonymously merging and reserving the number N of the feature words.
And (3) outputting:
1) feature word sets of N elements in total;
2) merge schema set (synonyms that decide to be merged should be merged to that retained token).
The method comprises the following steps:
1) constructing a word-document matrix according to the condition that the training document contains words;
2) sequencing the word-document matrix;
3) singular value decomposition is carried out on the word-document matrix by using SVD to obtain left and right singular vectors and singular value standard types of the word-document matrix;
4) reserving the first k columns of data of the left singular vector, and clearing all other columns of data;
5) reserving the first k data on the diagonal line of the singular value standard type, and clearing all the other diagonal line data;
6) reserving the first k columns of data of the right singular vector, and clearing all other columns of data;
7) multiplying the left and right singular vectors after zero clearing with the singular value standard model to obtain a new word-document matrix;
8) sequencing the new word-document matrix;
9) when the number of words exceeds N, performing steps 10-14, and performing synonymy merging;
10) searching two characteristic words with the maximum similarity in the characteristic word set;
11) deleting any one of the feature word pairs with the maximum similarity in the feature word set;
12) searching a merging scheme containing any characteristic word in the characteristic word pair with the maximum similarity in the merging scheme set;
13) if the feature words are found, adding another feature word into the merging feature word set of the merging scheme;
14) if the merging scheme can not be matched, a merging scheme is constructed by the two characteristic words and is put into a merging scheme set.
And 2, algorithm: and constructing an algorithm of the associated word set.
Inputting:
1) training a document set;
2) correlation merging threshold: support s, confidence c.
And (3) outputting:
1) associating the merged feature word sets;
2) and associating the merged merging scheme set.
The method comprises the following steps:
1) obtaining all single-dimensional associations with the support degrees and confidence degrees respectively larger than s and c by using an Aprior algorithm;
2) performing steps 3-6 on each single-dimensional association rule with the support degree and the confidence degree respectively larger than s and c, and performing association merging;
3) deleting the feature words at the right part of the association rule in the feature word set;
4) searching a merging scheme containing any edge feature word of the association rule in the merging scheme set;
5) if the feature words are found, adding another feature word into the merging feature word set of the merging scheme;
6) and if the merging scheme is not matched, constructing a merging scheme by using the two characteristic words on the left part and the right part of the association rule, and putting the merging scheme into a merging scheme set.
Practice proves that the construction of the retrieval association model can be effectively realized through potential semantic analysis and association rule mining.
In summary, the method for determining the document similarity based on semantic analysis provided by the embodiment of the invention can accurately determine the weights of different parts of the document based on semantic analysis, and further accurately determine the document similarity.
The embodiment of the invention also provides a document similarity determining device based on semantic analysis, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to the semantic analysis-based document similarity determination method, the implementation of the device can refer to the implementation of the semantic analysis-based document similarity determination method, and repeated parts are not repeated.
Fig. 5 is a schematic structural diagram of a document similarity determining apparatus based on semantic analysis according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes:
a dividing unit 01 for dividing each document to be compared into a plurality of parts;
the semantic analysis unit 02 is used for performing semantic analysis on each part to obtain a semantic analysis result of each part;
a weight value determining unit 03 for determining a weight value of each part of each document to be compared, based on a semantic analysis result of each part;
the processing unit 04 is configured to obtain a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;
and the similarity determining unit 05 is used for determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
In one embodiment, as shown in fig. 6, the semantic analysis unit 02 may include:
the word segmentation processing module 021 is used for performing word segmentation processing on each part to obtain a plurality of keywords corresponding to each part;
a feature extraction module 022, configured to extract, according to a plurality of keywords corresponding to each part and a preset document feature extraction policy, a plurality of types of key features from each part to form a feature set corresponding to each part;
and the feature evaluation module 024 is configured to perform semantic analysis on each part at a word level, a syntax level and a chapter level according to a feature set corresponding to each part to obtain a semantic analysis result of each part.
In an embodiment, as shown in fig. 7, the semantic analysis unit 02 may further include a feature dimension reduction module 023, configured to perform feature screening and feature combination on a feature set corresponding to each part by using a principal component analysis method, a linear discriminant analysis method, and a mutual information method, so as to obtain a feature set corresponding to each part after feature dimension reduction processing;
the feature evaluation module 024 is specifically configured to: and performing semantic analysis on each part at a word level, a syntax level and a chapter level according to the feature set corresponding to each part after feature dimension reduction processing to obtain a semantic analysis result of each part.
In one embodiment, the plurality of types of key features may include: static features of documents, features of documents associated with queries, and features of queries.
In one embodiment, the preset document feature extraction policy may include: according to the occurrence frequency of the keywords in the document, the inverse document frequency of the keywords, the part of speech of the keywords, whether the keywords are professional words or not, the positions of the keywords in the document, the text-rank values of the keywords, the information entropy values of the keywords, the word vector and global deviation values of the keywords, the lengths of the keywords, the keywords serving as the components of the sentence, whether the keywords are divided into sub-keywords or not, the lengths of the positions of the keywords in the document where the keywords occur for the first time and the positions of the keywords where the keywords occur for the last time, and the distribution deviation of the keywords, or any combination thereof, the document characteristics are extracted.
In an embodiment, when the keyword can be segmented into sub-keywords again, the preset document feature extraction policy may further include: and performing document feature extraction according to the word frequency-inverse document frequency of the sub-keywords, the part of speech of the sub-keywords and whether the sub-keywords are one of or any combination of professional words.
According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the document similarity determination method based on the semantic analysis.
The embodiment of the invention also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the method for determining the similarity of the documents based on the semantic analysis is realized.
An embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when executed by a processor, the computer program implements the above method for determining similarity of documents based on semantic analysis.
In the embodiment of the invention, compared with the technical scheme that different fixed weights are preset for different parts of contents of documents according to experience in the prior art to further determine the document similarity, the document similarity determination scheme based on semantic analysis has the problems that the weight setting is not accurate, and the document similarity determination is also not accurate, the semantic analysis based document similarity determination scheme comprises the following steps: dividing each document to be compared into a plurality of parts; performing semantic analysis on each part to obtain a semantic analysis result of each part; determining a weight value of each part of each document to be compared according to the semantic analysis result of each part; obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared; according to the weighted average result of each document to be compared, the similarity between the documents to be compared is determined, and the weights of different parts of the documents can be accurately determined based on semantic analysis, so that the similarity of the documents can be accurately determined.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and should not be used to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A document similarity determination method based on semantic analysis is characterized by comprising the following steps:
dividing each document to be compared into a plurality of parts;
performing semantic analysis on each part to obtain a semantic analysis result of each part;
determining a weight value of each part of each document to be compared according to the semantic analysis result of each part;
obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;
and determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
2. The document similarity determination method based on semantic analysis according to claim 1, wherein performing semantic analysis on each part to obtain a semantic analysis result of each part comprises:
performing word segmentation processing on each part to obtain a plurality of keywords corresponding to each part;
extracting a plurality of types of key features from each part according to a plurality of key words corresponding to each part and a preset document feature extraction strategy to form a feature set corresponding to each part;
and performing semantic analysis on each part at a word level, a syntax level and a chapter level according to the characteristic set corresponding to each part to obtain a semantic analysis result of each part.
3. The document similarity determination method based on semantic analysis according to claim 2, further comprising: screening and combining features of the feature set corresponding to each part by using a principal component analysis method, a linear discriminant analysis method and a mutual information method to obtain the feature set corresponding to each part after feature dimension reduction processing;
according to the feature set corresponding to each part, semantic analysis of word level, syntax level and chapter level is carried out on each part to obtain the semantic analysis result of each part, and the semantic analysis result comprises the following steps: and performing semantic analysis on each part at a word level, a syntax level and a chapter level according to the feature set corresponding to each part after feature dimension reduction processing to obtain a semantic analysis result of each part.
4. The semantic analysis-based document similarity determination method according to claim 2, wherein the plurality of types of key features comprise: static features of documents, features of documents associated with queries, and features of queries.
5. The document similarity determination method based on semantic analysis according to claim 2, wherein the preset document feature extraction strategy comprises: according to the occurrence frequency of the keywords in the document, the inverse document frequency of the keywords, the part of speech of the keywords, whether the keywords are professional words or not, the positions of the keywords in the document, the text-rank values of the keywords, the information entropy values of the keywords, the word vector and global deviation values of the keywords, the lengths of the keywords, the keywords serving as the components of the sentence, whether the keywords are divided into sub-keywords or not, the lengths of the positions of the keywords in the document where the keywords occur for the first time and the positions of the keywords where the keywords occur for the last time, and the distribution deviation of the keywords, or any combination thereof, the document characteristics are extracted.
6. The document similarity determination method based on semantic analysis according to claim 5, wherein when the keyword can be segmented into sub-keywords again, the preset document feature extraction strategy further comprises: and performing document feature extraction according to the word frequency-inverse document frequency of the sub-keywords, the parts of speech of the sub-keywords and whether the sub-keywords are one of or any combination of professional words.
7. A document similarity determination device based on semantic analysis is characterized by comprising:
a dividing unit for dividing each document to be compared into a plurality of parts;
the semantic analysis unit is used for carrying out semantic analysis on each part to obtain a semantic analysis result of each part;
the weighted value determining unit is used for determining the weighted value of each part of each document to be compared according to the semantic analysis result of each part;
the processing unit is used for obtaining a weighted average result of each document to be compared according to the weight value of each part of each document to be compared;
and the similarity determining unit is used for determining the similarity between the documents to be compared according to the weighted average result of each document to be compared.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 6.
CN202210240186.2A 2022-03-10 2022-03-10 Document similarity determination method and device based on semantic analysis Pending CN114580557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210240186.2A CN114580557A (en) 2022-03-10 2022-03-10 Document similarity determination method and device based on semantic analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210240186.2A CN114580557A (en) 2022-03-10 2022-03-10 Document similarity determination method and device based on semantic analysis

Publications (1)

Publication Number Publication Date
CN114580557A true CN114580557A (en) 2022-06-03

Family

ID=81775645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210240186.2A Pending CN114580557A (en) 2022-03-10 2022-03-10 Document similarity determination method and device based on semantic analysis

Country Status (1)

Country Link
CN (1) CN114580557A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626A (en) * 2023-06-09 2023-10-31 浙江大学 Automatic updating method and device for supply chain data model
CN116975626B (en) * 2023-06-09 2024-04-19 浙江大学 Automatic updating method and device for supply chain data model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975626A (en) * 2023-06-09 2023-10-31 浙江大学 Automatic updating method and device for supply chain data model
CN116975626B (en) * 2023-06-09 2024-04-19 浙江大学 Automatic updating method and device for supply chain data model

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN108763333B (en) Social media-based event map construction method
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN110298033B (en) Keyword corpus labeling training extraction system
CN111309925A (en) Knowledge graph construction method of military equipment
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
US20210350125A1 (en) System for searching natural language documents
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
CN111831786A (en) Full-text database accurate and efficient retrieval method for perfecting subject term
CN109783806A (en) A kind of text matching technique using semantic analytic structure
WO2020074787A1 (en) Method of searching patent documents
US20210397790A1 (en) Method of training a natural language search system, search system and corresponding use
CN112051986A (en) Code search recommendation device and method based on open source knowledge
CN113221559A (en) Chinese key phrase extraction method and system in scientific and technological innovation field by utilizing semantic features
Lin et al. A simple but effective method for Indonesian automatic text summarisation
Zehtab-Salmasi et al. FRAKE: fusional real-time automatic keyword extraction
CN114138979B (en) Cultural relic safety knowledge map creation method based on word expansion unsupervised text classification
Ajallouda et al. Kp-use: an unsupervised approach for key-phrases extraction from documents
Afuan et al. A new approach in query expansion methods for improving information retrieval
Quemy et al. ECHR-OD: On building an integrated open repository of legal documents for machine learning applications
Juan An effective similarity measurement for FAQ question answering system
Mezentseva et al. Optimization of analysis and minimization of information losses in text mining
CN114580557A (en) Document similarity determination method and device based on semantic analysis
CN114064855A (en) Information retrieval method and system based on transformer knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination