CN109902162B - Text similarity identification method based on digital fingerprints, storage medium and device - Google Patents

Text similarity identification method based on digital fingerprints, storage medium and device Download PDF

Info

Publication number
CN109902162B
CN109902162B CN201910142914.4A CN201910142914A CN109902162B CN 109902162 B CN109902162 B CN 109902162B CN 201910142914 A CN201910142914 A CN 201910142914A CN 109902162 B CN109902162 B CN 109902162B
Authority
CN
China
Prior art keywords
data information
text
information
similarity
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910142914.4A
Other languages
Chinese (zh)
Other versions
CN109902162A (en
Inventor
陈超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weizheng Technology Service Co ltd
Original Assignee
Weizheng Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weizheng Technology Service Co ltd filed Critical Weizheng Technology Service Co ltd
Priority to CN201910142914.4A priority Critical patent/CN109902162B/en
Publication of CN109902162A publication Critical patent/CN109902162A/en
Application granted granted Critical
Publication of CN109902162B publication Critical patent/CN109902162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text similarity identification method, a storage medium and a device based on digital fingerprints; the method solves the problem of low accuracy of text similarity detection, and the technical scheme is characterized in that two pieces of text data information are obtained; performing text preprocessing on the text data information and forming input data vector information through a hash function; searching text semantic information corresponding to the text data information from a preset database and forming characteristic data vector information through a hash function; forming matrix data information according to the mutual corresponding relation of the input data vector information and the characteristic data vector information; processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information; the similarity between the reference text data information and the comparison text data information is judged, and the similarity between the two texts can be more accurately judged.

Description

Text similarity identification method based on digital fingerprints, storage medium and device
Technical Field
The present invention relates to a method for text similarity identification, and more particularly, to a method, a storage medium, and an apparatus for text similarity identification based on digital fingerprints.
Background
With the rapid development of the internet, the development of information technology is also rapidly advanced, the quantity of various information resources is increased at a dramatic speed, and how to quickly and accurately retrieve information by accurately calculating the similarity between texts is a problem to be solved urgently.
The Text similarity calculation method is applied to various fields of computer technology, such as the field of Text Retrieval (Text Retrieval), and the Text similarity can improve the Recall rate (Recall) and accuracy (Precision) of a search engine; in the field of Text Mining (Text Mining), text similarity is used as a measurement method for discovering potential knowledge in a Text database; in the field of web-based Image Retrieval (Image Retrieval), accuracy can be improved by using descriptive short texts around images. In addition, the Text similarity calculation method can also be applied to other research fields, including the fields of Text Summarization (Text Summarization), text classification (textcategory), and Machine Translation (Machine Translation).
The patent contains a large amount of scientific and technological achievements and innovation technologies as a carrier for recording human achievements. The rapid development of scientific technology has led to a dramatic increase in the number of patent applications per year. In a traditional retrieval mode, a returned result is matched through a retrieval word, the number of the retrieval words is generally used as the correlation of patents, and semantic information contained in the patents is not considered. The essence of patent review is to review related patents with high patent similarity, and the most important point is to calculate the similarity of patent texts. The text similarity is generally calculated by representing a text by using a vector space model and then directly calculating the vector similarity in a vector space as the text similarity.
Text similarity methods can be mainly classified into two categories: one is to convert the text into a vector form by using a vector space model and then calculate, and the other is to represent the relation between different long and short texts by using a semantic dictionary method and reflect the similarity between the texts by using the matching number of keywords. The method for calculating the similarity of the Chinese patent texts in the prior art has the problem of semantic information loss, and the prior art has inaccurate calculation of the similarity of the Chinese texts, low accuracy and recall rate of calculation results, can not accurately reflect the similarity of the patent texts and can not meet the requirements of practical application.
Disclosure of Invention
The first purpose of the invention is to provide a text similarity identification method based on digital fingerprints, which realizes the judgment of the similarity of patent texts according to the digital fingerprints and a convolutional neural network, and has higher accuracy.
The technical purpose of the invention is realized by the following technical scheme:
a text similarity identification method based on digital fingerprints comprises the following steps:
acquiring two pieces of text data information, which are respectively defined as reference text data information and comparison text data information;
performing text preprocessing on the text data information and forming input data vector information through a hash function; searching text semantic information corresponding to the text data information from a preset database and forming characteristic data vector information through a hash function; forming matrix data information according to the mutual corresponding relation of the input data vector information and the characteristic data vector information;
processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information;
and judging the similarity between the standard text data information and the comparison text data information according to the similar data information corresponding to the standard text data information and the similar data information corresponding to the text data information through a similarity function.
By adopting the scheme, when the similarity of two text data information needs to be judged, the text is processed and coded, and multiple voices corresponding to words or sentences in the text are also coded, so that the two are associated to form matrix data, namely, the text information is converted into the matrix data similar to pictures, then the trained convolutional neural network is used for carrying out data operation to judge the similarity of the two texts, the similarity of the two texts can be more accurately obtained according to the calculation of the matrix data and the neural network, and the judgment accuracy is higher.
Preferably, the method for preprocessing the text data information is as follows:
normalizing the text data information to obtain first data information;
performing hash value calculation on the first data information through a hash function to obtain hash value sequence information;
and taking the hash value sequence information as input data vector information according to a preset separation rule.
By adopting the scheme, because different text data information has different formats, the text data information needs to be fragmented and structured to the same format, so that the text can be identified by a program, and reading and operation of subsequent data are facilitated.
Preferably, the hash function is provided with a plurality of hash functions, and high-dimensional matrix data information is formed according to a hash operation formula, which is as follows:
Hash(X t Y)Y i ·Z;Y i representing the operation on the ith hash dimension;
wherein, the input data vector information is marked as X 1 , X 2 , X 3 , …, X N (ii) a The characteristic data vector information is recorded as Z 1 , Z 2 , Z 3 , …, Z M (ii) a According to the presetThe operation data vector information formed by the plurality of hash functions is marked as Y 1 , Y 2 , Y 3 , …, Y P
By adopting the scheme, the high-dimensional matrix data information can be further formed by establishing the Hash dimension, the three-dimensional matrix is formed, the problem of low accuracy caused by only one Hash operation function is solved, and the Hash function can generate the same operation result on different data, so that when some part of data is the same, the data can be distinguished by adopting an alternative operation mode to improve the accuracy, and the stability of the whole data is higher.
Preferably, the method of forming the similar data information is as follows:
processing high-dimensional matrix data information according to a pre-trained convolutional neural network model to acquire required characteristic data information;
inputting the acquired required characteristic data information into a preset activation function to convert linear data into nonlinear data so as to form modified characteristic data information;
the modified characteristic data information is subjected to mean pooling of a pooling layer and then pooled characteristic data information is output;
and correlating the pooled feature data information through a full connection layer to form similar data information.
Preferably, the pre-trained convolutional neural network model comprises eight convolutional layers; the eight convolutional layers respectively correspond to the eight pre-trained convolutional neural network models.
By adopting the scheme, the required characteristic data information is screened according to the trained convolutional neural network model, and similar data information is formed after a series of data processing, so that the subsequent judgment between two texts is facilitated; meanwhile, the convolutional layers are provided with eight layers, and each convolutional layer is provided with a corresponding convolutional neural network model, so that the required characteristic data information can be screened layer by layer, and the accuracy of acquiring the related characteristic data information is improved.
Preferably, the method of training the convolutional neural network model is as follows:
constructing an RPN convolutional neural network;
initializing the RPN convolutional neural network, and initializing parameters to be trained in the network by using different small random numbers;
taking two pieces of text data information corresponding to different similarities as input training sample data, giving the input training sample data reference frames with multiple scales and multiple proportions, training an RPN (resilient packet network) by inputting the reference frames of the training sample data into the initialized RPN convolutional neural network, and adjusting parameters of the RPN convolutional neural network by using a back propagation BP (back propagation) algorithm to minimize a loss function value;
and applying the trained RPN convolutional neural network model on the training sample data to obtain a rough similarity selection box of the training sample set.
By adopting the scheme, the most preferable rough selection box of the similarity is obtained through a large amount of data training according to the training sample data taking the text data information with different similarities as the basis, so that the required characteristic data information is selected more accurately through the rough selection box of the similarity in the follow-up process, and the accuracy of the whole mental network is improved.
Preferably, the method for determining the similarity between the reference text data information and the comparison text data information by the similarity function is as follows:
defining the data output by the reference text data information as G1 (X1), and the data output by the text data information as G2 (X2), and obtaining the loss function formula of the similarity as follows:
L(X1,X2)=||G1(X1)-G2(X2)||
wherein L (X1, X2) is the similarity between the reference text data information and the comparison text data information.
By adopting the scheme, the similarity between the two texts is judged, the judgment mode is based on the proximity degree between the two data, namely if the difference value between the two data is small, the two data are similar, and if the difference value between the two data is large, the similarity between the two data is low.
The second purpose of the present invention is to provide a storage medium, which can store a corresponding instruction set, so as to realize the judgment of the similarity of patent texts, and the accuracy is higher.
The technical purpose of the invention is realized by the following technical scheme:
a storage medium storing a set of instructions adapted to be loaded by a processor and to perform a process comprising:
acquiring two pieces of text data information, which are respectively defined as reference text data information and comparison text data information;
performing text preprocessing on the text data information and forming input data vector information through a hash function; searching text semantic information corresponding to the text data information from a preset database and forming characteristic data vector information through a hash function; forming matrix data information according to the mutual corresponding relation between the input data vector information and the characteristic data vector information;
processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information;
and judging the similarity between the standard text data information and the comparison text data information according to the similar data information corresponding to the standard text data information and the similar data information corresponding to the text data information through a similarity function.
By adopting the scheme, when the similarity of two text data information needs to be judged, the text is processed and coded, and multiple voices corresponding to words or sentences in the text are coded, so that the two texts are associated to form matrix data, namely, the text information is converted into the matrix data similar to pictures, and then the trained convolutional neural network is used for carrying out data operation to judge the similarity of the two texts, so that the similarity of the two texts can be more accurately obtained according to the matrix data and the calculation of the neural network, and the judgment accuracy is higher.
The third purpose of the invention is to provide a recognition device, which can judge the similarity of patent texts with higher accuracy.
The technical purpose of the invention is realized by the following technical scheme:
an identification device comprising:
a processor for loading and executing a set of instructions; and
such as the storage medium described above.
By adopting the scheme, when the similarity of two text data information needs to be judged, the text is processed and coded, and multiple voices corresponding to words or sentences in the text are also coded, so that the two are associated to form matrix data, namely, the text information is converted into the matrix data similar to pictures, then the trained convolutional neural network is used for carrying out data operation to judge the similarity of the two texts, the similarity of the two texts can be more accurately obtained according to the calculation of the matrix data and the neural network, and the judgment accuracy is higher.
In conclusion, the invention has the following beneficial effects: the similarity of the two texts can be more accurately judged.
Drawings
FIG. 1 is a block flow diagram of a method for identifying text similarity based on digital fingerprints;
FIG. 2 is a block flow diagram of a method of pre-processing text data information;
FIG. 3 is a block flow diagram of a method of normalizing text data information;
FIG. 4 is a data mapping diagram of high-dimensional matrix data information;
FIG. 5 is a block flow diagram of a method of forming similar data information;
FIG. 6 is a schematic diagram of an activation function;
FIG. 7 is a schematic of mean pooling;
FIG. 8 is a block flow diagram of a method of training a convolutional neural network model.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The present embodiment is only for explaining the present invention, and it is not limited to the present invention, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present invention.
The embodiment of the invention provides a text similarity identification method based on digital fingerprints, which comprises the following steps: acquiring two pieces of text data information, which are respectively defined as reference text data information and comparison text data information; performing text preprocessing on the text data information and forming input data vector information through a hash function; searching text semantic information corresponding to the text data information from a preset database and forming characteristic data vector information through a hash function; forming matrix data information according to the mutual corresponding relation of the input data vector information and the characteristic data vector information; processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information; and judging the similarity between the standard text data information and the comparison text data information according to the similar data information corresponding to the standard text data information and the similar data information corresponding to the text data information through a similarity function.
In the embodiment of the invention, when the similarity of two text data information needs to be judged, the text is processed and coded, and multiple voices corresponding to words or sentences in the text are also coded, so that the two are associated to form a matrix data, namely, the text information is converted into the matrix data similar to pictures, and then the similarity of the two is judged by performing data operation through the trained convolutional neural network.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship, unless otherwise specified.
The embodiments of the present invention will be described in further detail with reference to the drawings.
Referring to fig. 1, an embodiment of the present invention provides a method for identifying text similarity based on digital fingerprints, and the main flow of the method is described as follows.
As shown in fig. 1:
step 1000: two pieces of text data information are acquired.
Wherein, the text data information is respectively defined as reference text data information and comparison text data information; the two text data messages may be a complete text file, or a text, or a sentence, or a word, and in this embodiment, a complete patent text file is preferred.
Step 2000: text preprocessing is performed on the text data information and a hash function is performed to form input data vector information.
Since different text data information has different formats, the text data information needs to be fragmented and structured to the same format, so that the text can be identified by a program, and reading and operation of subsequent data are facilitated. As shown in fig. 2, the method for preprocessing the text data information is as follows:
step 2100: and normalizing the text data information to acquire first data information.
As shown in fig. 3, the method for performing normalization processing on text data information is as follows:
step 2110: and carrying out format conversion on the suspicious text.
The text in the formats of word, pdf and the like is identified by a program, unified in format and stored in a database. The method is a database unified structure, wherein an attribute f _ alarm _ title is a text title, and f _ after _ content is a text full text without an html tag, and the method mainly uses full text information of the attribute f _ after _ content.
Step 2120: and removing noise by adopting a regular expression method.
The regular expression is a logical formula for operating on character strings (including common characters (e.g. letters between a and z) and special characters (called "meta characters")), i.e. a "regular character string" is formed by using specific characters defined in advance and a combination of the specific characters, and is used for expressing a filtering logic for the character strings. A regular expression is a text pattern that describes one or more strings of characters to be matched when searching for text.
Step 2130: the english alphabet is normalized.
The normalization is a simplified calculation mode, that is, a dimensional expression is transformed into a dimensionless expression and becomes a scalar. This method is often used in many calculations, and in this embodiment, to prevent the interference of upper and lower cases of letters, the upper case letters are all converted into lower case letters.
Step 2140: by deactivating the vocabulary, stop words in the text are filtered out.
The Stop word is a word that is automatically filtered before or after processing natural language data (or text) in order to save storage space and improve search efficiency in information retrieval, and these Words are called Stop Words. The stop words are manually input and are not automatically generated, and the generated stop words form a stop word list. However, no explicit deactivation vocabulary can be applied to all tools. Even some tools explicitly avoid the use of stop words to support phrase searching.
Through the steps, format conversion is carried out on the text, noise such as numbers, stop words, prepositions, special symbols and the like in the text to be detected is filtered, the words are normalized, interference of upper and lower cases of English letters is removed, and the like, and then first data information is formed.
Step 2200: and performing hash value calculation on the first data information through a hash function to obtain hash value sequence information.
The hash function may adopt many different functions, for example, CRC32 operation mode, tomas.wang operation mode, PJW operation mode; MD5 operation method, SHA256 operation method.
The CRC32 operation mode is a polynomial coding technique, has a relatively high generation speed, and is preferentially selected, and specifically includes the following steps:
selecting a generator polynomial as follows:
G-32=100000100110000010001110110110111。
data D is shifted to the left by K bits and zero-filled to the right (where K is the highest power of the generator polynomial).
The shifted zero-filled data G is divided modulo-2.
Obtaining the remainder as CRC code; the specific calculation formula is as follows:
Figure 782910DEST_PATH_IMAGE001
wang operation mode is a shift hash, avoiding high-cost multiplication operation, and comprises the following specific steps:
the data is shifted to the left by N bits.
The data is shifted to the right by M bits.
Repeating the step 1 and the step 2 for a certain number of times.
5363 the calculation cost of the operation mode PJW is the highest, and the steps are as follows:
one data H is defined to be shifted left by 4 bits plus the data read in.
Whether the upper 4 bits are 0 or not is judged, and the highest 8 bits are used for confusion with other bits if not all 0 bits.
The highest 4 bits are rewritten to 0, and the step 1 loop is continued until the end of the data.
The MD5 operation manner and the SHA256 operation manner are both the existing hash algorithm manner, and therefore are not described herein.
Step 2300: and taking the hash value sequence information as input data vector information according to a preset separation rule.
The separation rule is that each sentence is used as a basic unit, namely, the separation rule is used as a separation point according to the coincidence of punctuations such as commas, periods, seals and the like, and the separation rule is used as corresponding input data vector information after hash value sequence information is separated and arranged according to the separation points.
Step 3000: and searching text semantic information corresponding to the text data information from a preset database, and forming characteristic data vector information through a hash function.
The first data information obtained after the text data information is normalized is searched for text semantic information corresponding to the first data information in a preset database, the text semantic information can be semanteme under different contexts, each semantic meaning is encoded through a hash function, and the specific hash function encoding mode can be three hash functions in the step 2200 or other hash functions; and forming corresponding characteristic data vector information after the hash function coding.
Step 4000: and forming matrix data information according to the mutual corresponding relation of the input data vector information and the characteristic data vector information.
The input data vector information and the characteristic data vector information have a corresponding relationship, that is, one input data vector information can correspond to one characteristic data vector information or a plurality of characteristic data vector information, so that corresponding matrix data information is established based on the corresponding relationship.
As shown in fig. 4, in this embodiment, several hash functions are set, and high-dimensional matrix data information is formed according to a hash operation formula, where the formula is as follows:
Hash(X t Y)Y i ·Z;Y i indicating that the operation is performed on the ith hash dimension.
Wherein, the input data vector information is marked as X 1 , X 2 , X 3 , …, X N (ii) a The characteristic data vector information is recorded as Z 1 , Z 2 , Z 3 , …, Z M (ii) a The operation data vector information formed according to a plurality of preset hash functions is marked as Y1, Y2, Y3, …, YP.
Assuming that the phrase "preschool education" appearing in a patent is data-encoded, the phrase may represent both an industry and a behavior and thus have different characteristics from different contexts, an example of the calculation is as follows:
suppose that the input data vector information is coded as X [30 ], the existing characteristic data vector information is coded as Z [ 1], and two hash operations Y [ H1H 2] are selected
The following data were obtained after matrix operation:
Figure 168892DEST_PATH_IMAGE002
and further mapping the three-dimensional matrix to a characteristic space with Y matrix operation to obtain a three-dimensional matrix:
[ [30H1 30H2][22H1 22H2], [30H1 30H2][22H1 22H2] ]
and Hn represents that corresponding hash operation is carried out to generate the identification fingerprint.
The whole matrix data is changed into the multidimensional matrix data through the combination of a plurality of hash operations, and the hash function can generate the same operation result on different data, so when some part of data is the same, the data can be distinguished by adopting an alternative operation mode to improve the accuracy, and the stability of the whole data is higher.
Step 5000: and processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information.
As shown in fig. 5, the method for forming similar data information is as follows:
step 5110: and processing the high-dimensional matrix data information according to the pre-trained convolutional neural network model to acquire the required characteristic data information.
Wherein the pre-trained convolutional neural network model comprises eight convolutional layers; the eight convolutional layers respectively correspond to the eight pre-trained convolutional neural network models.
Step 5120: inputting the acquired required characteristic data information into a preset activation function to convert linear data into nonlinear data so as to form modified characteristic data information.
As shown in fig. 6, the activation function is preferably ReLU (x) = function
Figure 214208DEST_PATH_IMAGE003
Step 5130: and performing mean pooling on the modified characteristic data information by a pooling layer and outputting pooled characteristic data information.
As shown in fig. 7, the mean pooling is to combine the data matrices into a number according to 3 × 3 lattices, and the combination is performed by averaging the numbers in 3 × 3. For example, a 9x9 matrix can be divided into 9 3x3 matrices, and the 3x3 matrices are obtained after all the matrixes are pooled. The purpose of pooling is to reduce the amount of data.
Step 5140: and correlating the pooled feature data information through a full connection layer to form similar data information.
As shown in fig. 8, the method of the trained convolutional neural network model is as follows:
step 5210: constructing an RPN (resilient packet network) convolutional neural network;
step 5220: initializing the RPN convolutional neural network, and initializing parameters to be trained in the network by using different small random numbers;
step 5230: taking two pieces of text data information corresponding to different similarities as input training sample data, giving the input training sample data reference frames with multiple scales and proportions, training an RPN (resilient packet network) by inputting the reference frames of the training sample data into the initialized RPN convolutional neural network, and adjusting parameters of the RPN convolutional neural network by using a back propagation BP (back propagation) algorithm to minimize a loss function value;
step 5240: and applying the trained RPN convolutional neural network model on the training sample data to obtain a rough similarity selection box of the training sample set.
According to the training sample data taking a plurality of text data information with different similarities as the basis, the most preferable rough similarity selection box is obtained through a large amount of data training, so that the required feature data information is selected through the rough similarity selection box more accurately in the follow-up process, and the accuracy of the whole mental network is improved.
Step 6000: and judging the similarity between the standard text data information and the comparison text data information according to the similar data information corresponding to the standard text data information and the similar data information corresponding to the text data information through a similarity function.
The method for judging the similarity between the standard text data information and the comparison text data information through the similarity function comprises the following steps:
defining the data output by the reference text data information as G1 (X1), and the data output by the text data information as G2 (X2), and obtaining the loss function formula of the similarity as follows:
L(X1,X2)=||G1(X1)-G2(X2)||
wherein L (X1, X2) is the similarity between the reference text data information and the comparison text data information. By judging the similarity between two texts, the judgment mode is based on the proximity degree between two data, namely if the difference value between the two data is small, the two data are similar, and if the difference value between the two data is large, the two data are low in similarity.
Embodiments of the present invention provide a storage medium storing a set of instructions adapted to be loaded and executed by a processor, including fig. 1-8. The individual steps described in the flow.
The computer storage medium includes, for example: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Based on the same inventive concept, an embodiment of the present invention provides an identification apparatus, including: a processor for loading and executing a set of instructions; and the storage medium described above.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method and the core idea of the present invention, and should not be construed as limiting the present invention. Those skilled in the art should also appreciate that they can easily conceive of various changes and substitutions within the technical scope of the present disclosure.

Claims (9)

1. A text similarity identification method based on digital fingerprints is characterized by comprising the following steps:
acquiring two pieces of text data information, which are respectively defined as reference text data information and comparison text data information;
performing text preprocessing on the text data information and forming input data vector information through a hash function; searching text semantic information corresponding to the text data information from a preset database and forming characteristic data vector information through a hash function; forming matrix data information according to the mutual corresponding relation of the input data vector information and the characteristic data vector information;
processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information;
and judging the similarity between the standard text data information and the comparison text data information according to the similar data information corresponding to the standard text data information and the similar data information corresponding to the text data information through a similarity function.
2. The method for recognizing text similarity based on digital fingerprints as claimed in claim 1, wherein the method for preprocessing the text data information comprises the following steps:
normalizing the text data information to obtain first data information;
performing hash value calculation on the first data information through a hash function to obtain hash value sequence information;
and taking the hash value sequence information as input data vector information according to a preset separation rule.
3. The method for recognizing text similarity based on digital fingerprints as claimed in claim 1, wherein: the hash functions are provided with a plurality of hash operation formulas, high-dimensional matrix data information is formed according to the hash operation formulas, and the formulas are as follows:
Hash(X t Y)Y i ·Z;Y i representing the operation on the ith hash dimension;
wherein, the input data vector information is marked as X 1 , X 2 , X 3 , …, X N (ii) a The characteristic data vector information is noted as Z 1 , Z 2 , Z 3 , …, Z M (ii) a Recording the operation data vector information formed according to a plurality of preset hash functions as Y 1 , Y 2 , Y 3 , …, Y P
4. The method for recognizing text similarity based on digital fingerprints as claimed in claim 1, wherein the method for forming the similar data information comprises the following steps:
processing high-dimensional matrix data information according to a pre-trained convolutional neural network model to acquire required characteristic data information;
inputting the acquired required characteristic data information into a preset activation function, and converting linear data into nonlinear data to form modified characteristic data information;
the modified characteristic data information is subjected to mean pooling of a pooling layer and then pooled characteristic data information is output;
and correlating the pooled feature data information through a full connection layer to form similar data information.
5. The method for recognizing text similarity based on digital fingerprints as claimed in claim 4, wherein: the pre-trained convolutional neural network model comprises eight convolutional layers; the eight convolutional layers respectively correspond to the eight pre-trained convolutional neural network models.
6. The method for recognizing text similarity based on digital fingerprints as claimed in claim 5, wherein: the method of the trained convolutional neural network model is as follows:
constructing an RPN (resilient packet network) convolutional neural network;
initializing the RPN convolutional neural network, and initializing parameters to be trained in the network by using different small random numbers;
taking two pieces of text data information corresponding to different similarities as input training sample data, giving the input training sample data reference frames with multiple scales and multiple proportions, training an RPN (resilient packet network) by inputting the reference frames of the training sample data into the initialized RPN convolutional neural network, and adjusting parameters of the RPN convolutional neural network by using a back propagation BP (back propagation) algorithm to minimize a loss function value;
and applying the trained RPN convolutional neural network model on the training sample data to obtain a rough similarity selection box of the training sample set.
7. The method for recognizing text similarity based on digital fingerprints as claimed in claim 1, wherein the similarity function is used to determine the similarity between the reference text data information and the comparison text data information as follows:
defining the data output by the reference text data information as G1 (X1), and the data output by the text data information as G2 (X2), and obtaining the loss function formula of the similarity as follows:
L(X1,X2)=||G1(X1)-G2(X2)||
wherein L (X1, X2) is the similarity between the reference text data information and the comparison text data information.
8. A storage medium having stored thereon a set of instructions adapted to be loaded by a processor and to perform a process comprising:
acquiring two pieces of text data information, which are respectively defined as reference text data information and comparison text data information;
performing text preprocessing on the text data information and forming input data vector information through a hash function; searching text semantic information corresponding to the text data information from a preset database and forming characteristic data vector information through a hash function; forming matrix data information according to the mutual corresponding relation of the input data vector information and the characteristic data vector information;
processing and analyzing the matrix data information according to the pre-trained convolutional neural network model to form similar data information;
and judging the similarity between the standard text data information and the comparison text data information according to the similar data information corresponding to the standard text data information and the similar data information corresponding to the text data information through a similarity function.
9. An identification device, comprising:
a processor for loading and executing a set of instructions; and
the storage medium of claim 8.
CN201910142914.4A 2019-02-26 2019-02-26 Text similarity identification method based on digital fingerprints, storage medium and device Active CN109902162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910142914.4A CN109902162B (en) 2019-02-26 2019-02-26 Text similarity identification method based on digital fingerprints, storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910142914.4A CN109902162B (en) 2019-02-26 2019-02-26 Text similarity identification method based on digital fingerprints, storage medium and device

Publications (2)

Publication Number Publication Date
CN109902162A CN109902162A (en) 2019-06-18
CN109902162B true CN109902162B (en) 2022-11-29

Family

ID=66945664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910142914.4A Active CN109902162B (en) 2019-02-26 2019-02-26 Text similarity identification method based on digital fingerprints, storage medium and device

Country Status (1)

Country Link
CN (1) CN109902162B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112295231A (en) * 2020-11-05 2021-02-02 中国联合网络通信集团有限公司 Operation training method and server
CN114338090A (en) * 2021-12-08 2022-04-12 北京达佳互联信息技术有限公司 Data security detection method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168952B (en) * 2017-05-15 2021-06-04 北京百度网讯科技有限公司 Information generation method and device based on artificial intelligence

Also Published As

Publication number Publication date
CN109902162A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN110851596A (en) Text classification method and device and computer readable storage medium
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN109063055A (en) Homologous binary file search method and device
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111291177A (en) Information processing method and device and computer storage medium
US11892998B2 (en) Efficient embedding table storage and lookup
CN108319583A (en) Method and system for extracting knowledge from Chinese language material library
CN116304748B (en) Text similarity calculation method, system, equipment and medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
CN112328655A (en) Text label mining method, device, equipment and storage medium
CN113127607A (en) Text data labeling method and device, electronic equipment and readable storage medium
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
CN105808522A (en) Method and apparatus for semantic association
CN117909505B (en) Event argument extraction method and related equipment
CN112364666B (en) Text characterization method and device and computer equipment
CN115688771B (en) Document content comparison performance improving method and system
CN116720123B (en) Account identification method, account identification device, terminal equipment and medium
CN115422934B (en) Entity identification and linking method and system for space text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant