WO2022121171A1 - Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique - Google Patents
Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique Download PDFInfo
- Publication number
- WO2022121171A1 WO2022121171A1 PCT/CN2021/083714 CN2021083714W WO2022121171A1 WO 2022121171 A1 WO2022121171 A1 WO 2022121171A1 CN 2021083714 W CN2021083714 W CN 2021083714W WO 2022121171 A1 WO2022121171 A1 WO 2022121171A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- standard
- text
- semantic representation
- target
- word
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 238000012216 screening Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 92
- 230000011218 segmentation Effects 0.000 claims description 86
- 238000012545 processing Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000001131 transforming effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000012512 characterization method Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 11
- 235000015277 pork Nutrition 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000007726 management method Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013481 data capture Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the present application relates to the technical field of speech semantics, and in particular, to a similar text matching method, apparatus, electronic device, and computer-readable storage medium.
- the inventor realizes that the current main similar text matching methods are mostly keyword-based similar text matching, that is, extracting the keywords in the text, comparing and analyzing the keywords between different texts, and obtaining the degree of coincidence between the keywords, The similarity between different texts is judged according to the degree of coincidence.
- this method due to the inconsistency of words in the texts, this method often cannot accurately match the similar texts of the target texts. Therefore, how to improve the matching accuracy of similar texts becomes a an urgent problem to be solved.
- a similar text matching method including:
- the standard text corresponding to the standard semantic representation of which the matching probability is greater than a preset probability threshold is a text similar to the target text.
- a similar text matching device includes:
- a feature word extraction module used to obtain standard text, and extract feature words from the standard text to obtain standard feature words
- a standard representation building module for constructing a standard semantic representation corresponding to the standard feature word
- a key-value pair table generating module configured to generate a standard key-value pair table according to the standard feature word and the standard semantic representation
- the target representation building module is used to obtain the target text, perform feature word extraction on the target text, obtain the target feature word, and construct the target semantic representation corresponding to the target feature word;
- the similarity calculation module is used to calculate the similarity between the target feature word and the standard feature word in the standard key-value pair table, and determine that the standard semantic representation corresponding to the standard feature word whose similarity is greater than the preset similarity threshold is to be matching semantic representations;
- a representation matching module configured to perform representation matching between the target semantic representation and the to-be-matched semantic representation to obtain a matching probability between the target semantic representation and the standard semantic representation;
- a text screening module configured to determine that the standard text corresponding to the standard semantic representation whose matching probability is greater than a preset probability threshold is a similar text to the target text.
- An electronic device comprising:
- a processor that executes the instructions stored in the memory to achieve the following steps:
- the standard text corresponding to the standard semantic representation of which the matching probability is greater than a preset probability threshold is a text similar to the target text.
- a computer-readable storage medium having at least one instruction stored in the computer-readable storage medium, the at least one instruction being executed by a processor in an electronic device to implement the following steps:
- the standard text corresponding to the standard semantic representation of which the matching probability is greater than a preset probability threshold is a text similar to the target text.
- the present application can solve the problem of low matching accuracy of similar texts.
- FIG. 1 is a schematic flowchart of a similar text matching method provided by an embodiment of the present application.
- FIG. 2 is a functional block diagram of a similar text matching apparatus provided by an embodiment of the present application.
- FIG. 3 is a schematic structural diagram of an electronic device for implementing the similar text matching method provided by an embodiment of the present application
- FIG. 4 is an example diagram of a standard key-value pair table in an embodiment of the present application.
- the embodiment of the present application provides a similar text matching method.
- the execution body of the similar text matching method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
- the similar text matching method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
- the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
- the similar text matching method includes:
- the standard text is any textual text, for example, news text, novel paragraph text, or paper text, and the like.
- the standard text can be obtained from a blockchain node for storing standard text by using a python statement with a data capture function, and the high data throughput of the blockchain node can be used to improve the acquisition. Efficiency of standard text.
- the feature word extraction is performed on the standard text to obtain standard feature words, including:
- the plurality of text word segmentations are screened according to the word segmentation index to obtain standard feature words.
- performing word segmentation processing on the standard text to obtain multiple text segmentations including:
- the preset stop thesaurus and the preset standard thesaurus are thesaurus containing multiple word segmentations.
- the preset stop word database stores word segmentations of multiple stop words, for example, “Sur” and “Ruci”.
- the preset standard thesaurus contains multiple non-stop word segmentations, for example, "eat”, “sleep”.
- the embodiment of the present application performs word segmentation processing on standard text, and can divide a standard text with a relatively large length into multiple word segmentations.
- the analysis and processing of multiple word segmentations is more efficient and accurate than processing directly through standard text.
- the word segmentation index refers to an index that can reflect the importance of word segmentation, for example, a frequency index indicating the frequency of occurrence of word segmentation, a weight index indicating the weight of word segmentation, and the like.
- the use of the index algorithm to calculate the word segmentation index of each word segmentation in the plurality of text segmentations includes:
- the following index algorithm is used to calculate the word segmentation index of each word segment in the plurality of text word segments:
- TF i is the frequency of the occurrence of the segment i in the multiple text segments
- IDF i is the opposite value of the frequency of the segment i in the multiple text segments.
- the multiple text word segmentations are screened by comparing the size of the word segmentation indicators, that is, the text segmentation indicators corresponding to word segmentation indicators greater than a preset indicator threshold are selected as standard feature words.
- feature word extraction is performed on standard text, which can reduce the amount of data in subsequent matching, and is beneficial to improve the matching efficiency of similar texts.
- the construction of the standard semantic representation corresponding to the standard feature word includes:
- the text within a preset length range before and after the needle feature word is used as the standard semantic representation corresponding to the standard feature word.
- the embodiment of the present application can realize the realization of the text that is abstracted into the standard feature word, so as to increase the semantics of the standard feature word, which is beneficial to improve the accuracy of similar text matching.
- the generating a standard key-value pair table according to the standard feature word and the standard semantic representation includes:
- the multiple standard feature words are respectively used as primary keys in the standard key-value pair table
- the standard semantic representation corresponding to the plurality of standard feature words is used as the primary key value of the primary key in the standard key-value pair table to obtain a standard key-value pair table.
- FIG. 4 is an example diagram of a standard key-value pair table in the embodiment of the present application.
- different standard feature words are primary keys, and the corresponding standard feature words can be uniquely found according to the standard feature words. Standard semantic representation.
- standard feature words and standard semantic representations are stored in the standard key-value pair table in the form of key-value pairs, and the standard key-value pair table can be used. Improve the efficiency of subsequent similar text matching.
- the target text includes any text that needs to be similar matched, and the target text is analyzed to determine whether the standard text is similar to the target text.
- the target text can be uploaded by the user.
- the steps of extracting feature words from the target text to obtain target feature words are the same as the steps of extracting feature words from the standard text in step S1 to obtain standard feature words, and are not repeated here. Repeat.
- the step of constructing the target semantic representation corresponding to the target feature word is the same as the step of constructing the standard semantic representation corresponding to the standard feature word in step S2, which is not repeated here.
- the calculation of the similarity between the target feature word and the standard feature word in the standard key-value pair table includes:
- R is the target feature word
- S is the standard feature word
- Pearson is the similarity operation
- Sim is the similarity between the target feature word and the standard feature word in the standard key-value pair table.
- the embodiment of the present application determines that the standard semantic representation corresponding to the standard feature word whose similarity is greater than the preset similarity threshold is the semantic representation to be matched.
- the similarity between target feature word A and standard feature word B is 40
- the similarity between target feature word A and standard feature word C is 50
- target feature word A and standard feature word D The similarity between them is 60
- the preset similarity threshold is 55
- the performing the representation matching between the target semantic representation and the to-be-matched semantic representation to obtain a matching probability between the target semantic representation and the standard semantic representation includes:
- a probability operation is performed on the first representation vector and the second representation vector by using a pre-trained matching model to obtain a matching probability between the target semantic representation and the standard semantic representation.
- performing word vector transformation on the target semantic representation to obtain a first representation vector including:
- the byte vector set includes a byte vector of each byte in the target semantic representation
- the byte vectors corresponding to each byte in the target semantic representation are spliced respectively to obtain the first representation vector.
- byte 1, byte 2, and byte 3 exist in the target semantic representation, where the byte vector corresponding to byte 1 is byte vector a, the byte vector corresponding to byte 2 is byte vector b, and the byte vector corresponding to byte 2 is byte vector b.
- the byte vector corresponding to Section 3 is the byte vector c, then the byte vectors corresponding to each byte are spliced separately to obtain the first representation vector abc.
- the steps of converting the standard semantic representation to word vectors to obtain the second representation vector are the same as the steps of converting the target semantic representation to word vectors to obtain the first representation vector, which will not be repeated here.
- the embodiment of the present application inputs the first characterization vector and the second characterization vector into a pre-trained matching model, and uses the matching model to calculate the matching probability between the first characterization vector and the second characterization vector.
- the matching model adopts a multi-hop model
- the multi-hop model includes but is not limited to the CogQA model and the AnsweringTasks model
- the multi-hop model is used as the matching model to perform probability operations on the first characterization vector and the second characterization vector. , which can improve the efficiency of calculating the matching probability and help to improve the accuracy of the calculated matching probability.
- the matching probability is less than or equal to a preset probability threshold, it is determined that the standard text corresponding to the standard semantic representation is not a similar text to the target text, and if the matching probability is greater than the probability threshold , then it is determined that the standard text corresponding to the standard semantic representation is a similar text to the target text.
- the similar text matching method proposed in this application can solve the problem of low matching accuracy of similar texts.
- FIG. 2 it is a functional block diagram of a similar text matching apparatus provided by an embodiment of the present application.
- the similar text matching apparatus 100 described in this application can be installed in an electronic device.
- the similar text matching apparatus 100 may include a feature word extraction module 101 , a standard representation construction module 102 , a key-value pair table generation module 103 , a target representation construction module 104 , a similarity calculation module 105 , and a representation matching module 106 and text filtering module 107.
- the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of the electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
- each module/unit is as follows:
- the feature word extraction module 101 is used for obtaining standard text, and extracting feature words from the standard text to obtain standard feature words.
- the standard text is any textual text, for example, news text, novel paragraph text, or paper text, and the like.
- the standard text can be obtained from a blockchain node for storing standard text by using a python statement with a data capture function, and the high data throughput of the blockchain node can be used to improve the acquisition. Efficiency of standard text.
- the feature word extraction module 101 is specifically used for:
- the plurality of text word segmentations are screened according to the word segmentation index to obtain standard feature words.
- performing word segmentation processing on the standard text to obtain multiple text segmentations including:
- the preset stop thesaurus and the preset standard thesaurus are thesaurus containing multiple word segmentations.
- the preset stop word database stores word segmentations of multiple stop words, for example, “Sur” and “Ruci”.
- the preset standard thesaurus contains multiple non-stop word segmentations, for example, "eat”, “sleep”.
- the embodiment of the present application performs word segmentation processing on standard text, and can divide a standard text with a relatively large length into multiple word segmentations.
- the analysis and processing of multiple word segmentations is more efficient and accurate than processing directly through standard text.
- the word segmentation index refers to an index that can reflect the importance of word segmentation, for example, a frequency index indicating the frequency of occurrence of word segmentation, a weight index indicating the weight of word segmentation, and the like.
- the use of the index algorithm to calculate the word segmentation index of each word segmentation in the plurality of text segmentations includes:
- the following index algorithm is used to calculate the word segmentation index of each word segment in the plurality of text word segments:
- TF i is the frequency of the occurrence of the segment i in the multiple text segments
- IDF i is the opposite value of the frequency of the segment i in the multiple text segments.
- the embodiment of the present application realizes the screening of the multiple text word segmentations by comparing the size of the word segmentation indicators, that is, selecting text segmentations corresponding to word segmentation indicators greater than a preset indicator threshold as standard feature words.
- feature word extraction is performed on standard text, which can reduce the amount of data in subsequent matching, and is beneficial to improve the matching efficiency of similar texts.
- the standard representation building module 102 is configured to construct a standard semantic representation corresponding to the standard feature word.
- the standard characterization building module 102 is specifically used for:
- the text within a preset length range before and after the needle feature word is used as the standard semantic representation corresponding to the standard feature word.
- the embodiment of the present application can realize the realization of the text that is abstracted into the standard feature word, so as to increase the semantics of the standard feature word, which is beneficial to improve the accuracy of similar text matching.
- the key-value pair table generating module 103 is configured to generate a standard key-value pair table according to the standard feature word and the standard semantic representation.
- the key-value pair table generating module 103 is specifically used for:
- the multiple standard feature words are respectively used as primary keys in the standard key-value pair table
- the standard semantic representation corresponding to the plurality of standard feature words is used as the primary key value of the primary key in the standard key-value pair table to obtain a standard key-value pair table.
- FIG. 4 is an example diagram of a standard key-value pair table in the embodiment of the present application.
- different standard feature words are primary keys, and the corresponding standard feature words can be uniquely found according to the standard feature words. Standard semantic representation.
- standard feature words and standard semantic representations are stored in the standard key-value pair table in the form of key-value pairs, and the standard key-value pair table can be used. Improve the efficiency of subsequent similar text matching.
- the target representation building module 104 is configured to acquire target text, extract feature words from the target text, obtain target feature words, and construct target semantic representations corresponding to the target feature words.
- the target text includes any text that needs to be similar matched, and the target text is analyzed to determine whether the standard text is similar to the target text.
- the target text can be uploaded by the user.
- the step of extracting feature words from the target text to obtain target feature words is consistent with the step of extracting feature words from the standard text by the feature word extraction module 101 to obtain standard feature words, I won't go into details here.
- the step of constructing the target semantic representation corresponding to the target feature word is consistent with the step of constructing the standard semantic representation corresponding to the standard feature word by the standard representation building module 102, and details are not repeated here.
- the similarity calculation module 105 is used to calculate the similarity between the target feature word and the standard feature word in the standard key-value pair table, and determine the standard semantics corresponding to the standard feature word whose similarity is greater than a preset similarity threshold.
- the representation is the semantic representation to be matched.
- the similarity calculation module 105 is specifically used for:
- R is the target feature word
- S is the standard feature word
- Pearson is the similarity operation
- Sim is the similarity between the target feature word and the standard feature word in the standard key-value pair table.
- the embodiment of the present application determines that the standard semantic representation corresponding to the standard feature word whose similarity is greater than the preset similarity threshold is the semantic representation to be matched.
- the similarity between target feature word A and standard feature word B is 40
- the similarity between target feature word A and standard feature word C is 50
- target feature word A and standard feature word D The similarity between them is 60
- the preset similarity threshold is 55
- the representation matching module 106 is configured to perform representation matching between the target semantic representation and the to-be-matched semantic representation to obtain a matching probability between the target semantic representation and the standard semantic representation.
- the characterization matching module 106 is specifically used for:
- a probability operation is performed on the first representation vector and the second representation vector by using a pre-trained matching model to obtain a matching probability between the target semantic representation and the standard semantic representation.
- performing word vector transformation on the target semantic representation to obtain a first representation vector including:
- the byte vector set includes a byte vector of each byte in the target semantic representation
- the byte vectors corresponding to each byte in the target semantic representation are spliced respectively to obtain the first representation vector.
- byte 1, byte 2, and byte 3 exist in the target semantic representation, where the byte vector corresponding to byte 1 is byte vector a, the byte vector corresponding to byte 2 is byte vector b, and the byte vector corresponding to byte 2 is byte vector b.
- the byte vector corresponding to Section 3 is the byte vector c, then the byte vectors corresponding to each byte are spliced separately to obtain the first representation vector abc.
- the steps of converting the standard semantic representation to word vectors to obtain the second representation vector are the same as the steps of converting the target semantic representation to word vectors to obtain the first representation vector, which will not be repeated here.
- the embodiment of the present application inputs the first characterization vector and the second characterization vector into a pre-trained matching model, and uses the matching model to calculate the matching probability between the first characterization vector and the second characterization vector.
- the matching model adopts a multi-hop model
- the multi-hop model includes but is not limited to the CogQA model and the AnsweringTasks model
- the multi-hop model is used as the matching model to perform probability operations on the first characterization vector and the second characterization vector. , which can improve the efficiency of calculating the matching probability and help to improve the accuracy of the calculated matching probability.
- the text screening module 107 is configured to determine that the standard text corresponding to the standard semantic representation whose matching probability is greater than a preset probability threshold is a similar text to the target text.
- the matching probability is less than or equal to a preset probability threshold, it is determined that the standard text corresponding to the standard semantic representation is not a similar text to the target text, and if the matching probability is greater than the probability threshold , then it is determined that the standard text corresponding to the standard semantic representation is a similar text to the target text.
- the similar text matching device proposed in the present application can solve the problem of low matching accuracy of similar texts.
- FIG. 3 it is a schematic structural diagram of an electronic device for implementing a similar text matching method provided by an embodiment of the present application.
- the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a similar text matching program 12.
- the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
- the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
- the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
- the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
- the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the similar text matching program 12, etc., but also can be used to temporarily store data that has been output or will be output.
- the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
- Central Processing Unit CPU
- microprocessor digital processing chip
- graphics processor and combination of various control chips, etc.
- the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing programs or modules (such as similar) stored in the memory 11. text matching programs, etc.), and call data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
- the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
- PCI peripheral component interconnect
- EISA Extended industry standard architecture
- the bus can be divided into address bus, data bus, control bus and so on.
- the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
- FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
- the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
- the device implements functions such as charge management, discharge management, and power consumption management.
- the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
- the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
- the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
- a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
- the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
- the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
- the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
- the similar text matching program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, can realize:
- the standard text corresponding to the standard semantic representation of which the matching probability is greater than a preset probability threshold is a text similar to the target text.
- the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
- the computer-readable storage medium may be volatile or non-volatile.
- the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
- the present application also provides a computer-readable storage medium.
- the computer-readable storage medium can be either volatile or non-volatile.
- the readable storage medium stores a computer program, and the computer program is electronically stored. When executed by the processor of the device, it can achieve:
- the standard text corresponding to the standard semantic representation of which the matching probability is greater than a preset probability threshold is a text similar to the target text.
- modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
- the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un procédé de mise en correspondance de textes similaires, comprenant les étapes consistant à : acquérir un texte standard, effectuer une extraction de mots caractéristiques sur le texte standard acquis, et construire une représentation sémantique standard selon un résultat d'extraction ; générer une table de paires de valeurs clés standard selon un mot caractéristique standard et la représentation sémantique standard (S3) ; effectuer une extraction de mots caractéristiques sur un texte cible acquis, et construire une représentation sémantique cible ; calculer la similarité entre un mot caractéristique cible et le mot caractéristique standard, et cribler, selon la similarité, une représentation sémantique à mettre en correspondance ; effectuer une mise en correspondance de représentations sur la représentation sémantique à mettre en correspondance et la représentation sémantique standard, de façon à obtenir une probabilité de mise en correspondance ; et déterminer le texte standard correspondant à la représentation sémantique standard, dont la probabilité de mise en correspondance est supérieure à une valeur seuil de probabilité prédéfinie, comme étant un texte similaire au texte cible (S7). En outre, la présente invention concerne également la technologie des chaînes de blocs. Le texte standard peut être stocké dans un nœud d'une chaîne de blocs. La présente invention concerne en outre un appareil de mise en correspondance de textes similaires, un dispositif électronique et un support de stockage lisible par ordinateur. Au moyen de la présente demande, le problème de la précision relativement faible de la mise en correspondance de textes similaires peut être résolu.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011435054.2A CN112541338A (zh) | 2020-12-10 | 2020-12-10 | 相似文本匹配方法、装置、电子设备及计算机存储介质 |
CN202011435054.2 | 2020-12-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022121171A1 true WO2022121171A1 (fr) | 2022-06-16 |
Family
ID=75019869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/083714 WO2022121171A1 (fr) | 2020-12-10 | 2021-03-30 | Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112541338A (fr) |
WO (1) | WO2022121171A1 (fr) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115186775A (zh) * | 2022-09-13 | 2022-10-14 | 北京远鉴信息技术有限公司 | 一种图像描述文字的匹配度检测方法、装置及电子设备 |
CN115545001A (zh) * | 2022-11-29 | 2022-12-30 | 支付宝(杭州)信息技术有限公司 | 一种文本匹配方法及装置 |
CN115879901A (zh) * | 2023-02-22 | 2023-03-31 | 陕西湘秦衡兴科技集团股份有限公司 | 一种智能人事自助服务平台 |
CN116932767A (zh) * | 2023-09-18 | 2023-10-24 | 江西农业大学 | 基于知识图谱的文本分类方法、系统、存储介质及计算机 |
CN117371435A (zh) * | 2023-10-09 | 2024-01-09 | 北京睿企信息科技有限公司 | 一种获取热度发生波动的热词的数据处理系统 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541338A (zh) * | 2020-12-10 | 2021-03-23 | 平安科技(深圳)有限公司 | 相似文本匹配方法、装置、电子设备及计算机存储介质 |
CN112883730B (zh) * | 2021-03-25 | 2023-01-17 | 平安国际智慧城市科技股份有限公司 | 相似文本匹配方法、装置、电子设备及存储介质 |
CN113158683A (zh) * | 2021-04-15 | 2021-07-23 | 平安国际智慧城市科技股份有限公司 | 重要事项提醒方法、装置、电子设备及计算机存储介质 |
CN113486266B (zh) * | 2021-06-29 | 2024-05-21 | 平安银行股份有限公司 | 页面标签添加方法、装置、设备及存储介质 |
CN115934880A (zh) * | 2022-10-31 | 2023-04-07 | 永道工程咨询有限公司 | 一种工程造价文档数据库构建和工程造价文档检索方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165291A (zh) * | 2018-06-29 | 2019-01-08 | 厦门快商通信息技术有限公司 | 一种文本匹配方法及电子设备 |
US20200242304A1 (en) * | 2017-11-29 | 2020-07-30 | Tencent Technology (Shenzhen) Company Limited | Text recommendation method and apparatus, and electronic device |
CN111639502A (zh) * | 2020-05-26 | 2020-09-08 | 深圳壹账通智能科技有限公司 | 文本语义匹配方法、装置、计算机设备及存储介质 |
CN111898643A (zh) * | 2020-07-01 | 2020-11-06 | 上海依图信息技术有限公司 | 一种语义匹配方法及装置 |
CN112541338A (zh) * | 2020-12-10 | 2021-03-23 | 平安科技(深圳)有限公司 | 相似文本匹配方法、装置、电子设备及计算机存储介质 |
-
2020
- 2020-12-10 CN CN202011435054.2A patent/CN112541338A/zh active Pending
-
2021
- 2021-03-30 WO PCT/CN2021/083714 patent/WO2022121171A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200242304A1 (en) * | 2017-11-29 | 2020-07-30 | Tencent Technology (Shenzhen) Company Limited | Text recommendation method and apparatus, and electronic device |
CN109165291A (zh) * | 2018-06-29 | 2019-01-08 | 厦门快商通信息技术有限公司 | 一种文本匹配方法及电子设备 |
CN111639502A (zh) * | 2020-05-26 | 2020-09-08 | 深圳壹账通智能科技有限公司 | 文本语义匹配方法、装置、计算机设备及存储介质 |
CN111898643A (zh) * | 2020-07-01 | 2020-11-06 | 上海依图信息技术有限公司 | 一种语义匹配方法及装置 |
CN112541338A (zh) * | 2020-12-10 | 2021-03-23 | 平安科技(深圳)有限公司 | 相似文本匹配方法、装置、电子设备及计算机存储介质 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115186775A (zh) * | 2022-09-13 | 2022-10-14 | 北京远鉴信息技术有限公司 | 一种图像描述文字的匹配度检测方法、装置及电子设备 |
CN115186775B (zh) * | 2022-09-13 | 2022-12-16 | 北京远鉴信息技术有限公司 | 一种图像描述文字的匹配度检测方法、装置及电子设备 |
CN115545001A (zh) * | 2022-11-29 | 2022-12-30 | 支付宝(杭州)信息技术有限公司 | 一种文本匹配方法及装置 |
CN115545001B (zh) * | 2022-11-29 | 2023-04-07 | 支付宝(杭州)信息技术有限公司 | 一种文本匹配方法及装置 |
CN115879901A (zh) * | 2023-02-22 | 2023-03-31 | 陕西湘秦衡兴科技集团股份有限公司 | 一种智能人事自助服务平台 |
CN115879901B (zh) * | 2023-02-22 | 2023-07-28 | 陕西湘秦衡兴科技集团股份有限公司 | 一种智能人事自助服务平台 |
CN116932767A (zh) * | 2023-09-18 | 2023-10-24 | 江西农业大学 | 基于知识图谱的文本分类方法、系统、存储介质及计算机 |
CN116932767B (zh) * | 2023-09-18 | 2023-12-12 | 江西农业大学 | 基于知识图谱的文本分类方法、系统、存储介质及计算机 |
CN117371435A (zh) * | 2023-10-09 | 2024-01-09 | 北京睿企信息科技有限公司 | 一种获取热度发生波动的热词的数据处理系统 |
CN117371435B (zh) * | 2023-10-09 | 2024-04-05 | 北京睿企信息科技有限公司 | 一种获取热度发生波动的热词的数据处理系统 |
Also Published As
Publication number | Publication date |
---|---|
CN112541338A (zh) | 2021-03-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022121171A1 (fr) | Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique | |
WO2022134759A1 (fr) | Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique | |
WO2022142593A1 (fr) | Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible | |
WO2022160449A1 (fr) | Procédé et appareil de classification de texte, dispositif électronique et support de stockage | |
WO2019174132A1 (fr) | Procédé de traitement de données, serveur et support de stockage informatique | |
WO2020108063A1 (fr) | Procédé, appareil et serveur de détermination de mots caractéristiques | |
CN110532347B (zh) | 一种日志数据处理方法、装置、设备和存储介质 | |
WO2022116435A1 (fr) | Procédé et appareil de génération de titre, dispositif électronique et support de stockage | |
WO2022160454A1 (fr) | Procédé et appareil de récupération de littérature médicale, dispositif électronique, et support de stockage | |
WO2022222943A1 (fr) | Procédé et appareil de recommandation de département, dispositif électronique et support de stockage | |
WO2022222300A1 (fr) | Procédé et appareil d'extraction de relation ouverte, dispositif électronique et support de stockage | |
WO2022142020A1 (fr) | Procédé et appareil de poussée d'informations, dispositif électronique et support de stockage lisible par ordinateur | |
WO2022134355A1 (fr) | Procédé et appareil de recherche basés sur une invite de mots-clé, dispositif électronique et support de stockage | |
US9213759B2 (en) | System, apparatus, and method for executing a query including boolean and conditional expressions | |
WO2022142106A1 (fr) | Procédé et appareil d'analyse de texte, dispositif électronique et support de stockage lisible | |
WO2022121172A1 (fr) | Procédé et appareil de correction d'erreur de texte, dispositif électronique et support de stockage lisible par ordinateur | |
WO2022179122A1 (fr) | Procédé et appareil de stockage de données utilisant des mégadonnées, et dispositif électronique et support de stockage | |
CN113434542B (zh) | 数据关系识别方法、装置、电子设备及存储介质 | |
CN113722600A (zh) | 应用于大数据的数据查询方法、装置、设备及产品 | |
CN113282854A (zh) | 数据请求响应方法、装置、电子设备及存储介质 | |
WO2022141860A1 (fr) | Procédé et appareil de déduplication de texte, dispositif électronique et support de stockage lisible par ordinateur | |
CN113434413B (zh) | 基于数据差异的数据测试方法、装置、设备及存储介质 | |
WO2022141867A1 (fr) | Procédé et appareil de reconnaissance de parole, dispositif électronique et support de stockage lisible | |
WO2022141838A1 (fr) | Procédé et appareil d'analyse de confiance de modèle, dispositif électronique et support de stockage informatique | |
WO2022134345A1 (fr) | Procédé d'accès à un fichier, appareil, dispositif et support de stockage lisible |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901887 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21901887 Country of ref document: EP Kind code of ref document: A1 |