CN112541338A - Similar text matching method and device, electronic equipment and computer storage medium - Google Patents

Similar text matching method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN112541338A
CN112541338A CN202011435054.2A CN202011435054A CN112541338A CN 112541338 A CN112541338 A CN 112541338A CN 202011435054 A CN202011435054 A CN 202011435054A CN 112541338 A CN112541338 A CN 112541338A
Authority
CN
China
Prior art keywords
standard
text
target
semantic representation
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011435054.2A
Other languages
Chinese (zh)
Inventor
谢静文
阮晓雯
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011435054.2A priority Critical patent/CN112541338A/en
Publication of CN112541338A publication Critical patent/CN112541338A/en
Priority to PCT/CN2021/083714 priority patent/WO2022121171A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a voice semantic technology, and discloses a similar text matching method, which comprises the following steps: extracting feature words of the obtained standard text, and constructing a standard semantic representation according to an extraction result; generating a standard key value pair table according to the standard feature words and the standard semantic representation; extracting feature words from the obtained target text and constructing a target semantic representation; calculating the similarity between the target characteristic words and the standard characteristic words, and screening semantic representations to be matched according to the similarity; performing representation matching on the semantic representation to be matched and the standard semantic representation to obtain matching probability; and determining that the standard text corresponding to the standard semantic representation with the matching probability larger than the preset probability threshold is the similar text of the target text. In addition, the invention also relates to a block chain technology, and the standard text can be stored in the nodes of the block chain. The invention also provides a similar text matching device, electronic equipment and a computer readable storage medium. The method and the device can solve the problem of low matching accuracy of similar texts.

Description

Similar text matching method and device, electronic equipment and computer storage medium
Technical Field
The present invention relates to the field of speech semantic technology, and in particular, to a method and an apparatus for matching similar texts, an electronic device, and a computer-readable storage medium.
Background
With the increase of the requirements for retrieving and searching for duplicate documents, for example, redundancy needs to be removed when multiple documents are stored, similar texts need to be searched when the documents are searched for duplicate documents, and people in daily life increasingly use computers to process texts so as to match texts similar to target texts according to processing results.
At present, most of similar text matching methods are similar text matching based on keywords, namely, the keywords in the text are extracted, the keywords between different texts are contrastively analyzed to obtain the coincidence degree between the keywords, and the similarity between different texts is judged according to the coincidence degree.
Disclosure of Invention
The invention provides a similar text matching method, a similar text matching device and a computer readable storage medium, and mainly aims to solve the problem of low accuracy of similar text matching.
In order to achieve the above object, the present invention provides a matching method for similar texts, comprising:
acquiring a standard text, and extracting feature words of the standard text to obtain standard feature words;
constructing a standard semantic representation corresponding to the standard feature words;
generating a standard key value pair table according to the standard feature words and the standard semantic representation;
acquiring a target text, extracting feature words of the target text to obtain target feature words, and constructing a target semantic representation corresponding to the target feature words;
calculating the similarity between the target feature word and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature word with the similarity larger than a preset similarity threshold as a semantic representation to be matched;
performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation;
and determining that the standard text corresponding to the standard semantic representation with the matching probability larger than a preset probability threshold is the similar text of the target text.
Optionally, the extracting the feature words from the standard text to obtain the standard feature words includes:
performing word segmentation processing on the standard text to obtain a plurality of text word segments;
respectively calculating word segmentation indexes of the plurality of text word segmentations;
and screening the text participles according to the participle indexes to obtain standard characteristic words.
Optionally, the performing word segmentation processing on the standard text to obtain a plurality of text word segments includes:
deleting stop words contained in the standard text by using a preset stop word bank;
and performing word segmentation processing on the standard text after the stop words are deleted by utilizing a preset standard word bank to obtain a plurality of text word segments.
Optionally, the constructing a standard semantic representation corresponding to the standard feature word includes:
traversing the standard text and determining the position information of the standard characteristic words in the standard text;
and taking the text in a preset length range before and after the pointer feature words as standard semantic representations corresponding to the standard feature words according to the position information.
Optionally, the generating a standard key value table according to the standard feature words and the standard semantic representation includes:
respectively taking the plurality of standard characteristic words as primary keys in the standard key value pair table;
and taking the standard semantic representations corresponding to the standard feature words as the primary key values of the primary keys in the standard key value pair table to obtain the standard key value pair table.
Optionally, the performing representation matching on the target semantic representation and the to-be-matched semantic representation to obtain a matching probability between the target semantic representation and the standard semantic representation includes:
performing word vector conversion on the target semantic representation to obtain a first representation vector;
performing word vector conversion on the standard semantic representation to obtain a second representation vector;
and performing probability operation on the first characterization vector and the second characterization vector by using a pre-trained matching model to obtain the matching probability of the target semantic characterization and the standard semantic characterization.
Optionally, the performing word vector conversion on the target semantic representation to obtain a first representation vector includes:
obtaining a byte vector set corresponding to the target semantic representation, wherein the byte vector set comprises byte vectors of all bytes in the target semantic representation;
and respectively splicing byte vectors corresponding to each byte in the target semantic representation to obtain the first representation vector.
In order to solve the above problem, the present invention further provides a similar text matching apparatus, including:
the characteristic word extraction module is used for acquiring a standard text and extracting characteristic words of the standard text to obtain standard characteristic words;
the standard representation construction module is used for constructing a standard semantic representation corresponding to the standard feature words;
the key value pair table generating module is used for generating a standard key value pair table according to the standard characteristic words and the standard semantic representation;
the target representation construction module is used for acquiring a target text, extracting characteristic words from the target text to obtain target characteristic words and constructing target semantic representations corresponding to the target characteristic words;
the similarity calculation module is used for calculating the similarity between the target feature words and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature words with the similarity larger than a preset similarity threshold as a semantic representation to be matched;
the representation matching module is used for performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation;
and the text screening module is used for determining that the standard text corresponding to the standard semantic representation with the matching probability larger than a preset probability threshold is the similar text of the target text.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the similar text matching method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the similar text matching method described above.
The embodiment of the invention realizes the preliminary screening of the standard text by using the feature words and can improve the matching efficiency of the similar text by respectively extracting the feature words of the standard text and the target text and the semantic representations corresponding to the feature words and calculating the similarity between the feature words of the standard text and the target text; through the matching between the semantic representations of the quasi text and the target text, the similarity judgment of the standard text by using the semantic representations containing more semantics is realized, and the matching accuracy of the similar text is improved. Therefore, the similar text matching method, the similar text matching device, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem of low matching accuracy of similar texts.
Drawings
Fig. 1 is a schematic flowchart of a similar text matching method according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of an apparatus for matching similar texts according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing the similar text matching method according to an embodiment of the present invention;
fig. 4 is an exemplary diagram of a standard key value pair table according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a similar text matching method. The execution subject of the similar text matching method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server, a terminal, and the like. In other words, the similar text matching method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Fig. 1 is a schematic flow chart of a similar text matching method according to an embodiment of the present invention. In this embodiment, the similar text matching method includes:
and S1, acquiring a standard text, and extracting the characteristic words of the standard text to obtain the standard characteristic words.
In the embodiment of the present invention, the standard text is any text with characters, such as news text, novel paragraph text or thesis text.
In detail, the embodiment of the invention can utilize the python statement with the data capture function to acquire the standard text from the block chain node for storing the standard text, and can improve the efficiency of acquiring the standard text by utilizing the high throughput of the block chain node to the data.
In the embodiment of the present invention, the extracting the feature words from the standard text to obtain the standard feature words includes:
performing word segmentation processing on the standard text to obtain a plurality of text word segments;
respectively calculating word segmentation indexes of the plurality of text word segmentations;
and screening the text participles according to the participle indexes to obtain standard characteristic words.
In detail, the performing word segmentation processing on the standard text to obtain a plurality of text word segments includes:
deleting stop words contained in the standard text by using a preset stop word bank;
and performing word segmentation processing on the standard text after the stop words are deleted by utilizing a preset standard word bank to obtain a plurality of text word segments.
Specifically, the preset disabled word bank and the preset standard word bank are word banks containing a plurality of participles. The preset stop word bank stores a plurality of stop word segments, such as "rate" and "e.g. times". The preset standard word bank contains a plurality of word segments of non-stop words, such as 'eating' and 'sleeping'.
The embodiment of the invention performs word segmentation processing on the standard text, can divide the standard text with larger length into a plurality of words, and has higher processing efficiency and accuracy by analyzing and processing the plurality of words compared with directly performing processing through the standard text.
In the embodiment of the present invention, the word segmentation index refers to an index that can reflect the importance degree of a word segmentation, for example, a frequency index representing the occurrence frequency of a word segmentation, a weight index representing the weight of a word segmentation, and the like.
In an embodiment of the present invention, the calculating a word segmentation index of each word in the plurality of text words by using an index algorithm includes:
calculating a segmentation index for each of the plurality of text segments using an index algorithm as follows:
TD=TFi×IDFi
wherein, TFiFor the frequency of occurrence of a participle i in the plurality of text participles, IDFiIs the inverse value of the frequency with which the segmentation i occurs in the plurality of text segmentations.
Further, the text participles are screened by comparing the size of the participle index, that is, the text participles corresponding to the participle index larger than a preset index threshold are selected as standard feature words.
The embodiment of the invention extracts the characteristic words of the standard text, can reduce the data volume during subsequent matching and is beneficial to improving the matching efficiency of similar texts.
And S2, constructing a standard semantic representation corresponding to the standard feature words.
In an embodiment of the present invention, the constructing of the standard semantic representation corresponding to the standard feature word includes:
traversing the standard text and determining the position information of the standard characteristic words in the standard text;
and taking the text in a preset length range before and after the pointer feature words as standard semantic representations corresponding to the standard feature words according to the position information.
For example, there is standard text: the method comprises the steps of ' Xiaoming eating braised pork at noon ', wherein ' braised pork at noon ' is a standard feature word, traversing the standard text and determining the position of the standard feature word ' braised pork at noon ' in the standard text, and when the preset length range is 5 characters, determining that the standard semantic representation corresponding to the standard feature word ' braised pork at noon ' is ' eating braised pork at noon ' in today '.
According to the embodiment of the invention, the text abstracted into the standard characteristic words can be materialized by constructing the standard semantic representation corresponding to the standard characteristic words, so that the semantics of the standard characteristic words are increased, and the matching accuracy of similar texts is improved.
And S3, generating a standard key value pair table according to the standard feature words and the standard semantic representation.
In this embodiment of the present invention, the generating a standard key value table according to the standard feature words and the standard semantic representation includes:
respectively taking the plurality of standard characteristic words as primary keys in the standard key value pair table;
and taking the standard semantic representations corresponding to the standard feature words as the primary key values of the primary keys in the standard key value pair table to obtain the standard key value pair table.
In detail, referring to fig. 4, fig. 4 is an exemplary diagram of a standard key value pair table according to an embodiment of the present invention, and in fig. 4, different standard feature words are primary keys, and a corresponding standard semantic representation can be uniquely found according to the standard feature words.
In practical application, a large number of standard texts need to be subjected to similar text matching, so that the standard feature words and the standard semantic representations are stored in the standard key value pair table in a key value pair mode, and the efficiency of subsequent similar text matching can be improved by utilizing the standard key value pair table.
S4, obtaining a target text, extracting the characteristic words of the target text to obtain target characteristic words, and constructing a target semantic representation corresponding to the target characteristic words.
In the embodiment of the invention, the target text comprises any text needing similarity matching, and the target text is analyzed to judge whether the standard text is similar to the target text or not.
In detail, the target text may be uploaded by the user at his or her own.
In the embodiment of the present invention, the step of extracting the feature words from the target text to obtain the target feature words is consistent with the step of extracting the feature words from the standard text in step S1 to obtain the standard feature words, which is not described herein again.
The step of constructing the target semantic representation corresponding to the target feature word is consistent with the step of constructing the standard semantic representation corresponding to the standard feature word in step S2, and is not repeated here.
S5, calculating the similarity between the target feature words and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature words with the similarity larger than a preset similarity threshold as the semantic representation to be matched.
In this embodiment of the present invention, the calculating the similarity between the target feature word and the standard feature word in the table of standard key values includes:
calculating the similarity between the target characteristic word and the standard characteristic word in the standard key value table by using a similarity algorithm as follows:
Sim=Pearson(R,S)
wherein, R is the target feature word, S is the standard feature word, Pearson is a similarity operation, and Sim is the similarity between the target feature word and the standard feature word in the standard key value table.
Further, the embodiment of the present invention determines that the standard semantic representation corresponding to the standard feature word whose similarity is greater than the preset similarity threshold is a to-be-matched semantic representation, for example, there are a target feature word a, a standard feature word B, a standard feature cluster C, and a standard feature word D, where a similarity between the target feature word a and the standard feature word B is 40, a similarity between the target feature word a and the standard feature word C is 50, and a similarity between the target feature word a and the standard feature word D is 60, and when the preset similarity threshold is 55, determines that the standard semantic representation corresponding to the standard feature word D is the to-be-matched semantic representation.
S6, performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation.
In the embodiment of the present invention, the performing representation matching on the target semantic representation and the to-be-matched semantic representation to obtain the matching probability between the target semantic representation and the standard semantic representation includes:
performing word vector conversion on the target semantic representation to obtain a first representation vector;
performing word vector conversion on the standard semantic representation to obtain a second representation vector;
and performing probability operation on the first characterization vector and the second characterization vector by using a pre-trained matching model to obtain the matching probability of the target semantic characterization and the standard semantic characterization.
In detail, the performing word vector transformation on the target semantic representation to obtain a first representation vector includes:
obtaining a byte vector set corresponding to the target semantic representation, wherein the byte vector set comprises byte vectors of all bytes in the target semantic representation;
and respectively splicing byte vectors corresponding to each byte in the target semantic representation to obtain the first representation vector.
For example, byte 1, byte 2, and byte 3 exist in the target semantic representation, where the byte vector corresponding to byte 1 is byte vector a, the byte vector corresponding to byte 2 is byte vector b, and the byte vector corresponding to byte 3 is byte vector c, and then the byte vectors corresponding to each byte are respectively spliced to obtain a first representation vector abc.
The step of performing word vector transformation on the standard semantic representation to obtain a second representation vector is consistent with the step of performing word vector transformation on the target semantic representation to obtain a first representation vector, and is not repeated here.
Further, the first characterization vector and the second characterization vector are input into a matching model which is trained in advance, and the matching probability between the first characterization vector and the second characterization vector is obtained through calculation by using the matching model.
In detail, the matching model adopts a multi-hop model, the multi-hop model includes but is not limited to a CogQA model and an answeringTasks model, and the multi-hop model is used as the matching model to perform probability operation on the first characterization vector and the second characterization vector, so that the efficiency of calculating the matching probability can be improved, and the accuracy of the calculated matching probability can be improved.
S7, determining that the standard text corresponding to the standard semantic representation with the matching probability larger than the preset probability threshold is the similar text of the target text.
In the embodiment of the present invention, if the matching probability is less than or equal to a preset probability threshold, it is determined that the standard text corresponding to the standard semantic representation is not the similar text of the target text, and if the matching probability is greater than the probability threshold, it is determined that the standard text corresponding to the standard semantic representation is the similar text of the target text.
The embodiment of the invention realizes the preliminary screening of the standard text by using the feature words and can improve the matching efficiency of the similar text by respectively extracting the feature words of the standard text and the target text and the semantic representations corresponding to the feature words and calculating the similarity between the feature words of the standard text and the target text; through the matching between the semantic representations of the quasi text and the target text, the similarity judgment of the standard text by using the semantic representations containing more semantics is realized, and the matching accuracy of the similar text is improved. Therefore, the similar text matching method provided by the invention can solve the problem of low matching accuracy of similar texts.
Fig. 2 is a functional block diagram of an apparatus for matching similar texts according to an embodiment of the present invention.
The similar text matching apparatus 100 according to the present invention may be installed in an electronic device. According to the realized functions, the similar text matching device 100 can comprise a feature word extraction module 101, a standard representation construction module 102, a key-value pair table generation module 103, a target representation construction module 104, a similarity calculation module 105, a representation matching module 106 and a text screening module 107. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the feature word extraction module 101 is configured to obtain a standard text, and perform feature word extraction on the standard text to obtain a standard feature word.
In the embodiment of the present invention, the standard text is any text with characters, such as news text, novel paragraph text or thesis text.
In detail, the embodiment of the invention can utilize the python statement with the data capture function to acquire the standard text from the block chain node for storing the standard text, and can improve the efficiency of acquiring the standard text by utilizing the high throughput of the block chain node to the data.
In the embodiment of the present invention, the feature word extraction module 101 is specifically configured to:
acquiring a standard text;
performing word segmentation processing on the standard text to obtain a plurality of text word segments;
respectively calculating word segmentation indexes of the plurality of text word segmentations;
and screening the text participles according to the participle indexes to obtain standard characteristic words.
In detail, the performing word segmentation processing on the standard text to obtain a plurality of text word segments includes:
deleting stop words contained in the standard text by using a preset stop word bank;
and performing word segmentation processing on the standard text after the stop words are deleted by utilizing a preset standard word bank to obtain a plurality of text word segments.
Specifically, the preset disabled word bank and the preset standard word bank are word banks containing a plurality of participles. The preset stop word bank stores a plurality of stop word segments, such as "rate" and "e.g. times". The preset standard word bank contains a plurality of word segments of non-stop words, such as 'eating' and 'sleeping'.
The embodiment of the invention performs word segmentation processing on the standard text, can divide the standard text with larger length into a plurality of words, and has higher processing efficiency and accuracy by analyzing and processing the plurality of words compared with directly performing processing through the standard text.
In the embodiment of the present invention, the word segmentation index refers to an index that can reflect the importance degree of a word segmentation, for example, a frequency index representing the occurrence frequency of a word segmentation, a weight index representing the weight of a word segmentation, and the like.
In an embodiment of the present invention, the calculating a word segmentation index of each word in the plurality of text words by using an index algorithm includes:
calculating a segmentation index for each of the plurality of text segments using an index algorithm as follows:
TD=TFi×IDFi
wherein, TFiFor the frequency of occurrence of a participle i in the plurality of text participles, IDFiIs the inverse value of the frequency with which the segmentation i occurs in the plurality of text segmentations.
Further, the text participles are screened by comparing the size of the participle index, that is, the text participles corresponding to the participle index larger than a preset index threshold are selected as standard feature words.
The embodiment of the invention extracts the characteristic words of the standard text, can reduce the data volume during subsequent matching and is beneficial to improving the matching efficiency of similar texts.
The standard representation constructing module 102 is configured to construct a standard semantic representation corresponding to the standard feature word.
In an embodiment of the present invention, the standard representation construction module 102 is specifically configured to:
traversing the standard text and determining the position information of the standard characteristic words in the standard text;
and taking the text in a preset length range before and after the pointer feature words as standard semantic representations corresponding to the standard feature words according to the position information.
For example, there is standard text: the method comprises the steps of ' Xiaoming eating braised pork at noon ', wherein ' braised pork at noon ' is a standard feature word, traversing the standard text and determining the position of the standard feature word ' braised pork at noon ' in the standard text, and when the preset length range is 5 characters, determining that the standard semantic representation corresponding to the standard feature word ' braised pork at noon ' is ' eating braised pork at noon ' in today '.
According to the embodiment of the invention, the text abstracted into the standard characteristic words can be materialized by constructing the standard semantic representation corresponding to the standard characteristic words, so that the semantics of the standard characteristic words are increased, and the matching accuracy of similar texts is improved.
The key-value pair table generating module 103 is configured to generate a standard key-value pair table according to the standard feature words and the standard semantic representation.
In this embodiment of the present invention, the key-value pair table generating module 103 is specifically configured to:
respectively taking the plurality of standard characteristic words as primary keys in the standard key value pair table;
and taking the standard semantic representations corresponding to the standard feature words as the primary key values of the primary keys in the standard key value pair table to obtain the standard key value pair table.
In detail, referring to fig. 4, fig. 4 is an exemplary diagram of a standard key value pair table according to an embodiment of the present invention, and in fig. 4, different standard feature words are primary keys, and a corresponding standard semantic representation can be uniquely found according to the standard feature words.
In practical application, a large number of standard texts need to be subjected to similar text matching, so that the standard feature words and the standard semantic representations are stored in the standard key value pair table in a key value pair mode, and the efficiency of subsequent similar text matching can be improved by utilizing the standard key value pair table.
The target representation construction module 104 is configured to obtain a target text, perform feature word extraction on the target text to obtain a target feature word, and construct a target semantic representation corresponding to the target feature word.
In the embodiment of the invention, the target text comprises any text needing similarity matching, and the target text is analyzed to judge whether the standard text is similar to the target text or not.
In detail, the target text may be uploaded by the user at his or her own.
In the embodiment of the present invention, the step of extracting the feature words from the target text to obtain the target feature words is consistent with the step of extracting the feature words from the standard text by the feature word extraction module 101 to obtain the standard feature words, which is not described herein again.
The step of constructing the target semantic representation corresponding to the target feature word is consistent with the step of constructing the standard semantic representation corresponding to the standard feature word by the standard representation constructing module 102, and is not repeated here.
The similarity calculation module 105 is configured to calculate a similarity between the target feature word and a standard feature word in the standard key value table, and determine that a standard semantic representation corresponding to the standard feature word with the similarity greater than a preset similarity threshold is a semantic representation to be matched.
In the embodiment of the present invention, the similarity calculation module 105 is specifically configured to:
calculating the similarity between the target characteristic word and the standard characteristic word in the standard key value table by using a similarity algorithm as follows:
Sim=Pearson(R,S)
wherein, R is the target feature word, S is the standard feature word, Pearson is a similarity operation, and Sim is the similarity between the target feature word and the standard feature word in the standard key value table.
Further, the embodiment of the present invention determines that the standard semantic representation corresponding to the standard feature word whose similarity is greater than the preset similarity threshold is a to-be-matched semantic representation, for example, there are a target feature word a, a standard feature word B, a standard feature cluster C, and a standard feature word D, where a similarity between the target feature word a and the standard feature word B is 40, a similarity between the target feature word a and the standard feature word C is 50, and a similarity between the target feature word a and the standard feature word D is 60, and when the preset similarity threshold is 55, determines that the standard semantic representation corresponding to the standard feature word D is the to-be-matched semantic representation.
The representation matching module 106 is configured to perform representation matching on the target semantic representation and the to-be-matched semantic representation to obtain a matching probability between the target semantic representation and the standard semantic representation.
In this embodiment of the present invention, the characterization matching module 106 is specifically configured to:
performing word vector conversion on the target semantic representation to obtain a first representation vector;
performing word vector conversion on the standard semantic representation to obtain a second representation vector;
and performing probability operation on the first characterization vector and the second characterization vector by using a pre-trained matching model to obtain the matching probability of the target semantic characterization and the standard semantic characterization.
In detail, the performing word vector transformation on the target semantic representation to obtain a first representation vector includes:
obtaining a byte vector set corresponding to the target semantic representation, wherein the byte vector set comprises byte vectors of all bytes in the target semantic representation;
and respectively splicing byte vectors corresponding to each byte in the target semantic representation to obtain the first representation vector.
For example, byte 1, byte 2, and byte 3 exist in the target semantic representation, where the byte vector corresponding to byte 1 is byte vector a, the byte vector corresponding to byte 2 is byte vector b, and the byte vector corresponding to byte 3 is byte vector c, and then the byte vectors corresponding to each byte are respectively spliced to obtain a first representation vector abc.
The step of performing word vector transformation on the standard semantic representation to obtain a second representation vector is consistent with the step of performing word vector transformation on the target semantic representation to obtain a first representation vector, and is not repeated here.
Further, the first characterization vector and the second characterization vector are input into a matching model which is trained in advance, and the matching probability between the first characterization vector and the second characterization vector is obtained through calculation by using the matching model.
In detail, the matching model adopts a multi-hop model, the multi-hop model includes but is not limited to a CogQA model and an answeringTasks model, and the multi-hop model is used as the matching model to perform probability operation on the first characterization vector and the second characterization vector, so that the efficiency of calculating the matching probability can be improved, and the accuracy of the calculated matching probability can be improved.
The text screening module 107 is configured to determine that the standard text corresponding to the standard semantic representation with the matching probability greater than the preset probability threshold is a similar text of the target text.
In the embodiment of the present invention, if the matching probability is less than or equal to a preset probability threshold, it is determined that the standard text corresponding to the standard semantic representation is not the similar text of the target text, and if the matching probability is greater than the probability threshold, it is determined that the standard text corresponding to the standard semantic representation is the similar text of the target text.
The embodiment of the invention realizes the preliminary screening of the standard text by using the feature words and can improve the matching efficiency of the similar text by respectively extracting the feature words of the standard text and the target text and the semantic representations corresponding to the feature words and calculating the similarity between the feature words of the standard text and the target text; through the matching between the semantic representations of the quasi text and the target text, the similarity judgment of the standard text by using the semantic representations containing more semantics is realized, and the matching accuracy of the similar text is improved. Therefore, the similar text matching device provided by the invention can solve the problem of low matching accuracy of similar texts.
Fig. 3 is a schematic structural diagram of an electronic device implementing a similar text matching method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a similar text matching program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the similar text matching program 12, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., similar text matching programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The similar text matching program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
acquiring a standard text, and extracting feature words of the standard text to obtain standard feature words;
constructing a standard semantic representation corresponding to the standard feature words;
generating a standard key value pair table according to the standard feature words and the standard semantic representation;
acquiring a target text, extracting feature words of the target text to obtain target feature words, and constructing a target semantic representation corresponding to the target feature words;
calculating the similarity between the target feature word and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature word with the similarity larger than a preset similarity threshold as a semantic representation to be matched;
performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation;
and determining that the standard text corresponding to the standard semantic representation with the matching probability larger than a preset probability threshold is the similar text of the target text.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring a standard text, and extracting feature words of the standard text to obtain standard feature words;
constructing a standard semantic representation corresponding to the standard feature words;
generating a standard key value pair table according to the standard feature words and the standard semantic representation;
acquiring a target text, extracting feature words of the target text to obtain target feature words, and constructing a target semantic representation corresponding to the target feature words;
calculating the similarity between the target feature word and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature word with the similarity larger than a preset similarity threshold as a semantic representation to be matched;
performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation;
and determining that the standard text corresponding to the standard semantic representation with the matching probability larger than a preset probability threshold is the similar text of the target text.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of similar text matching, the method comprising:
acquiring a standard text, and extracting feature words of the standard text to obtain standard feature words;
constructing a standard semantic representation corresponding to the standard feature words;
generating a standard key value pair table according to the standard feature words and the standard semantic representation;
acquiring a target text, extracting feature words of the target text to obtain target feature words, and constructing a target semantic representation corresponding to the target feature words;
calculating the similarity between the target feature word and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature word with the similarity larger than a preset similarity threshold as a semantic representation to be matched;
performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation;
and determining that the standard text corresponding to the standard semantic representation with the matching probability larger than a preset probability threshold is the similar text of the target text.
2. The similar text matching method as claimed in claim 1, wherein said extracting feature words from said standard text to obtain standard feature words comprises:
performing word segmentation processing on the standard text to obtain a plurality of text word segments;
respectively calculating word segmentation indexes of the plurality of text word segmentations;
and screening the text participles according to the participle indexes to obtain standard characteristic words.
3. The similar text matching method of claim 2, wherein the performing word segmentation processing on the standard text to obtain a plurality of text word segments comprises:
deleting stop words contained in the standard text by using a preset stop word bank;
and performing word segmentation processing on the standard text after the stop words are deleted by utilizing a preset standard word bank to obtain a plurality of text word segments.
4. The similar text matching method as in claim 1, wherein the constructing of the standard semantic representation corresponding to the standard feature words comprises:
traversing the standard text and determining the position information of the standard characteristic words in the standard text;
and taking the text in a preset length range before and after the pointer feature words as standard semantic representations corresponding to the standard feature words according to the position information.
5. The similar text matching method of claim 1, wherein the generating a standard key value table from the standard feature words and the standard semantic representations comprises:
respectively taking the plurality of standard characteristic words as primary keys in the standard key value pair table;
and taking the standard semantic representations corresponding to the standard feature words as the primary key values of the primary keys in the standard key value pair table to obtain the standard key value pair table.
6. The similar text matching method according to any one of claims 1 to 5, wherein the performing the feature matching on the target semantic feature and the semantic feature to be matched to obtain the matching probability between the target semantic feature and the standard semantic feature comprises:
performing word vector conversion on the target semantic representation to obtain a first representation vector;
performing word vector conversion on the standard semantic representation to obtain a second representation vector;
and performing probability operation on the first characterization vector and the second characterization vector by using a pre-trained matching model to obtain the matching probability of the target semantic characterization and the standard semantic characterization.
7. The similar text matching method of claim 6, wherein the performing word vector transformation on the target semantic representation to obtain a first representation vector comprises:
obtaining a byte vector set corresponding to the target semantic representation, wherein the byte vector set comprises byte vectors of all bytes in the target semantic representation;
and respectively splicing byte vectors corresponding to each byte in the target semantic representation to obtain the first representation vector.
8. A similar text matching apparatus, said apparatus comprising:
the characteristic word extraction module is used for acquiring a standard text and extracting characteristic words of the standard text to obtain standard characteristic words;
the standard representation construction module is used for constructing a standard semantic representation corresponding to the standard feature words;
the key value pair table generating module is used for generating a standard key value pair table according to the standard characteristic words and the standard semantic representation;
the target representation construction module is used for acquiring a target text, extracting characteristic words from the target text to obtain target characteristic words and constructing target semantic representations corresponding to the target characteristic words;
the similarity calculation module is used for calculating the similarity between the target feature words and the standard feature words in the standard key value table, and determining the standard semantic representation corresponding to the standard feature words with the similarity larger than a preset similarity threshold as a semantic representation to be matched;
the representation matching module is used for performing representation matching on the target semantic representation and the semantic representation to be matched to obtain the matching probability of the target semantic representation and the standard semantic representation;
and the text screening module is used for determining that the standard text corresponding to the standard semantic representation with the matching probability larger than a preset probability threshold is the similar text of the target text.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the similar text matching method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a similar text matching method according to any one of claims 1 to 7.
CN202011435054.2A 2020-12-10 2020-12-10 Similar text matching method and device, electronic equipment and computer storage medium Pending CN112541338A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011435054.2A CN112541338A (en) 2020-12-10 2020-12-10 Similar text matching method and device, electronic equipment and computer storage medium
PCT/CN2021/083714 WO2022121171A1 (en) 2020-12-10 2021-03-30 Similar text matching method and apparatus, and electronic device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011435054.2A CN112541338A (en) 2020-12-10 2020-12-10 Similar text matching method and device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN112541338A true CN112541338A (en) 2021-03-23

Family

ID=75019869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011435054.2A Pending CN112541338A (en) 2020-12-10 2020-12-10 Similar text matching method and device, electronic equipment and computer storage medium

Country Status (2)

Country Link
CN (1) CN112541338A (en)
WO (1) WO2022121171A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883730A (en) * 2021-03-25 2021-06-01 平安国际智慧城市科技股份有限公司 Similar text matching method and device, electronic equipment and storage medium
CN113158683A (en) * 2021-04-15 2021-07-23 平安国际智慧城市科技股份有限公司 Important item reminding method and device, electronic equipment and computer storage medium
CN113486266A (en) * 2021-06-29 2021-10-08 平安银行股份有限公司 Page label adding method, device, equipment and storage medium
WO2022121171A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Similar text matching method and apparatus, and electronic device and computer storage medium
CN115934880A (en) * 2022-10-31 2023-04-07 永道工程咨询有限公司 Construction of project cost document database and search method of project cost document

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186775B (en) * 2022-09-13 2022-12-16 北京远鉴信息技术有限公司 Method and device for detecting matching degree of image description characters and electronic equipment
CN115545001B (en) * 2022-11-29 2023-04-07 支付宝(杭州)信息技术有限公司 Text matching method and device
CN115879901B (en) * 2023-02-22 2023-07-28 陕西湘秦衡兴科技集团股份有限公司 Intelligent personnel self-service platform
CN116932767B (en) * 2023-09-18 2023-12-12 江西农业大学 Text classification method, system, storage medium and computer based on knowledge graph
CN117371435B (en) * 2023-10-09 2024-04-05 北京睿企信息科技有限公司 Data processing system for acquiring hot words with fluctuation of heat

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840321B (en) * 2017-11-29 2022-02-01 腾讯科技(深圳)有限公司 Text recommendation method and device and electronic equipment
CN109165291B (en) * 2018-06-29 2021-07-09 厦门快商通信息技术有限公司 Text matching method and electronic equipment
CN111639502A (en) * 2020-05-26 2020-09-08 深圳壹账通智能科技有限公司 Text semantic matching method and device, computer equipment and storage medium
CN111898643B (en) * 2020-07-01 2024-02-23 上海依图信息技术有限公司 Semantic matching method and device
CN112541338A (en) * 2020-12-10 2021-03-23 平安科技(深圳)有限公司 Similar text matching method and device, electronic equipment and computer storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121171A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Similar text matching method and apparatus, and electronic device and computer storage medium
CN112883730A (en) * 2021-03-25 2021-06-01 平安国际智慧城市科技股份有限公司 Similar text matching method and device, electronic equipment and storage medium
CN113158683A (en) * 2021-04-15 2021-07-23 平安国际智慧城市科技股份有限公司 Important item reminding method and device, electronic equipment and computer storage medium
CN113486266A (en) * 2021-06-29 2021-10-08 平安银行股份有限公司 Page label adding method, device, equipment and storage medium
CN113486266B (en) * 2021-06-29 2024-05-21 平安银行股份有限公司 Page label adding method, device, equipment and storage medium
CN115934880A (en) * 2022-10-31 2023-04-07 永道工程咨询有限公司 Construction of project cost document database and search method of project cost document

Also Published As

Publication number Publication date
WO2022121171A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN112667800A (en) Keyword generation method and device, electronic equipment and computer storage medium
WO2022160449A1 (en) Text classification method and apparatus, electronic device, and storage medium
CN111460797B (en) Keyword extraction method and device, electronic equipment and readable storage medium
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN112507230B (en) Webpage recommendation method and device based on browser, electronic equipment and storage medium
CN114979120B (en) Data uploading method, device, equipment and storage medium
CN113886708A (en) Product recommendation method, device, equipment and storage medium based on user information
CN115238670B (en) Information text extraction method, device, equipment and storage medium
CN112667775A (en) Keyword prompt-based retrieval method and device, electronic equipment and storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113344125B (en) Long text matching recognition method and device, electronic equipment and storage medium
CN111475600B (en) Data management method, device and computer readable storage medium
CN113806492A (en) Record generation method, device and equipment based on semantic recognition and storage medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
CN115409041B (en) Unstructured data extraction method, device, equipment and storage medium
CN115438048A (en) Table searching method, device, equipment and storage medium
CN112733537A (en) Text duplicate removal method and device, electronic equipment and computer readable storage medium
CN113434413A (en) Data testing method, device and equipment based on data difference and storage medium
CN112506931A (en) Data query method and device, electronic equipment and storage medium
CN111414452A (en) Search word matching method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination