CN113204953A

CN113204953A - Text matching method and device based on semantic recognition and device readable storage medium

Info

Publication number: CN113204953A
Application number: CN202110587203.5A
Authority: CN
Inventors: 不公告发明人
Original assignee: Wuhan Honghuoyi Intelligent Technology Co ltd
Current assignee: Wuhan Honghuoyi Intelligent Technology Co ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-03

Abstract

The embodiment of the invention provides a text matching method based on semantic recognition, electronic equipment and a computer readable storage medium, and relates to the technical field of computer application. The text matching method comprises the following steps: obtaining a structural model of a text to be matched and a target text, wherein the structural model comprises a plurality of structural modules, and the structural modules comprise: keywords and corresponding sentence sets; aiming at each structural module of the structural model of the text to be matched, identifying a corresponding structural module from the structural model of the target text respectively so as to construct a model block group; and generating a matching result of the text to be matched according to the similarity between the modules in each module group. The method provided by the invention constructs the structure model of the text based on the keywords and the corresponding sentence sets to match the texts, can fully utilize the structure information of the text to execute the matching task, and can provide a text matching result with higher quality for long texts.

Description

Text matching method and device based on semantic recognition and device readable storage medium

Technical Field

The present invention relates to the field of computer application technologies, and in particular, to a text matching method, an electronic device, and a computer-readable storage medium.

Background

With the explosion of the mobile internet in recent years, a large number of applications and related self-media platforms based on content distribution services have emerged, which play an increasingly important role in people's daily life. Common content platforms are such as: the head of the day, the WeChat public account, the fast-report every day and the like provide timely and massive information and multi-element digital rich media content for vast user groups at any time, and meanwhile, the method and the way for people to obtain information are deeply changed. The platforms not only provide relevant content services for the search query requirements of the users, but also actively recommend the content of the platforms to the users, so that the potential requirements of the users can be better met, the users are attracted to stay to improve the frequency of using corresponding software, and finally the daily activity of the software is improved.

Most of the content information provided by the platforms takes characters as main carriers, and users can efficiently acquire information and viewpoint comments from text contents by reading the characters. When a user finishes reading an article, the software platform often actively recommends other articles with the same or similar topics for the user according to the interests and habits of the user. How to judge whether the main contents between articles relate to the same or similar subjects, namely judging the relationship between a pair of articles, can be really summarized as a text semantic matching task in the field of natural language processing.

It can be seen that with the development of text semantic matching technology, the sequence length of the text to be matched gradually transits from short text to long text. Although a great deal of related work aiming at short text matching tasks achieves an excellent effect by designing a better model algorithm on the similarity of modeling two sequences, along with the obvious change of the text length, the original short text matching algorithm cannot obtain a satisfactory result by directly inputting the long text.

Disclosure of Invention

An object of embodiments of the present invention is to provide a text matching method, an electronic device, and a computer-readable storage medium, so as to solve the above problems in the prior art. The specific technical scheme is as follows:

in one aspect of the present invention, a text matching method is provided. Specifically, the method comprises the following steps: obtaining a structure model in a text to be matched and a target text, wherein the structure model comprises a plurality of structure modules, and the structure modules comprise: keywords and corresponding sentence sets; aiming at each structural module of the structural model of the text to be matched, identifying a corresponding structural module from the structural model of the target text respectively so as to construct a model block group; and generating a matching result of the text to be matched according to the similarity between the modules in each module group.

In another aspect of the implementation of the invention, an electronic device is also provided. Specifically, the electronic device includes: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus; a memory for storing a computer program; and the processor is used for realizing the text matching method when executing the program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium. In particular, the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the text matching method described above.

According to the text matching method, the electronic device and the computer readable storage medium provided by the embodiment of the invention, the structural model of the text is constructed based on the keywords and the corresponding sentence sets to match the text, the structural information of the text can be fully utilized to execute the matching task, and for long texts, a higher-quality text matching result can be provided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a text matching method according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a text matching method according to embodiment 2 of the present invention;

fig. 3 is a flowchart of a text matching method according to embodiment 3 of the present invention;

FIG. 4 illustrates one embodiment of the process S240 illustrated in FIG. 2;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Various aspects of the invention are described in detail below with reference to the figures and the detailed description. Well-known processes, program modules, elements and their interconnections, links, communications or operations, among others, are not shown or described in detail herein in various embodiments of the invention.

Also, the described features, architectures, or functions may be combined in any manner in one or more embodiments.

Furthermore, it should be understood by those skilled in the art that the following embodiments are illustrative only and are not intended to limit the scope of the present invention. Those of skill would further appreciate that the program modules, elements, or steps of the various embodiments described herein and illustrated in the figures may be combined and designed in a wide variety of different configurations.

Technical terms not specifically described in the present specification should be construed in the broadest sense in the art unless otherwise specifically indicated.

In some of the flows described in the present specification and claims and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, the number of operations being labeled as S10, S11, etc., merely to distinguish between various operations, and the sequence number itself does not represent any order of execution. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

[ embodiment 1 ]

Fig. 1 is a flowchart of a text matching method according to embodiment 1 of the method of the present invention. Referring to fig. 1, in the present embodiment, the method includes:

s110: and acquiring a structural model in the text to be matched and the target text.

Wherein the structural model comprises a plurality of structural modules; the structural module includes: keywords and their corresponding sentence sets.

S120: and aiming at each structural module of the structural model of the text to be matched, identifying a corresponding structural module from the structural model of the target text respectively so as to construct a model block group.

For example, for each structural module in the structural model of the text to be matched, the module similarity between each structural module in the structural model of the text to be matched and each structural module in the structural model of the target text is calculated respectively. And combining the two modules with the highest similarity into a module group. Then, from the remaining modules, two modules with the highest similarity are selected to form a module group, and this process is repeated until each module is assigned to a certain module group (all modules are grouped).

Wherein the module similarity refers to the similarity between two or more modules.

Of course, in other embodiments of the present embodiment, the modules may be grouped in other manners, for example, the modules with the same module number in different structural models are divided into a group (the modules are numbered according to the weight of the modules).

S130: and generating a matching result of the text to be matched according to the similarity between the modules in each module group.

In the embodiment, the structure model of the text is constructed based on the keywords and the corresponding sentence sets to match the texts, so that the matching task can be executed by fully utilizing the structure information of the text, and a text matching result with higher quality can be provided for a long text.

[ embodiment 2 ]

Fig. 2 is a flowchart of a text matching method according to embodiment 2 of the method of the present invention. Referring to fig. 2, in the present embodiment, the method includes:

s210: and acquiring a structural model in the text to be matched and the target text.

S220: and respectively calculating the module similarity (similarity between the modules) of each structure module in the structure model of the text to be matched and each structure module in the structure model of the target text.

S230: and identifying corresponding structural modules from the structural models of the target texts aiming at the structural modules of the structural models of the texts to be matched based on the module similarity so as to construct a model block group.

For example, the two modules with the highest similarity are grouped into a module group. Then, from the remaining modules, two modules with the highest similarity are selected to form a module group, and this process is repeated until each module is assigned to a certain module group (all modules are completely grouped).

S240: and generating a matching result of the text to be matched according to the module similarity of each module group (the similarity between the modules in the module group).

In the present embodiment, the module similarity is calculated by: calculating the similarity of the keywords between the modules as a first sub-similarity; calculating the similarity of the sentence subsets between the modules as a second sub-similarity; and generating the similarity between the modules based on the weighted average of the first sub-similarity and the second sub-similarity.

For example, if the similarity between the module i and the module j is calculated, the similarity between the keyword of the module i and the keyword of the module j is calculated as a first sub-similarity between the module i and the module j, the similarity between the sentence set of the module i and the sentence set of the module j is calculated as a second sub-similarity between the module i and the module j, and the similarity between the module i and the module j is generated based on a weighted average value of the first sub-similarity and the second sub-similarity.

According to the embodiment, the similarity between the modules is generated based on the weighted average of the first sub-similarity between the keywords and the second sub-similarity between the sentence sets, so that the similarity calculation result between the modules is more accurate.

[ embodiment 3 ]

Fig. 3 is a flowchart of a text matching method according to embodiment 3 of the method of the present invention. Referring to fig. 3, in the present embodiment, the method includes:

s310: and respectively constructing a structure model of the text to be matched and the target text.

S320: and acquiring a structural model in the text to be matched and the target text.

For example, the structural model in the text to be matched and the target text is read or received from a designated storage terminal.

S330: and aiming at each structural module of the structural model of the text to be matched, identifying a corresponding structural module from the structural model of the target text respectively so as to construct a model block group.

In some embodiments of the present embodiment, first, for each structure module in the structure model of the text to be matched, the module similarity between the structure module and each structure module in the structure model of the target text is calculated respectively; secondly, based on the module similarity, aiming at each structure module of the structure model of the text to be matched, identifying a corresponding structure module from the structure model of the target text respectively so as to construct a model block group. For example, the two modules with the highest similarity are grouped into a module group. Then, from the remaining modules, two modules with the highest similarity are selected to form a module group, and this process is repeated until each module is assigned to a certain module group (all modules are completely grouped).

Wherein, the module similarity refers to the similarity between modules.

S340: and generating a matching result of the text to be matched according to the similarity between the modules in each module group.

In this embodiment, a structural model of the text is constructed by: extracting key words from the text; dividing sentences based on the correlation degree between the sentences and the keywords in the text, wherein one or more sentences divided into the same keyword form a sentence set corresponding to the keyword; defining each keyword and a sentence set corresponding to the keyword as a structure module; integrating all the structural modules to construct the structural model.

For example, given an article D, named entities and keywords, collectively referred to as keywords, in the article are extracted by the tetrank (text ranking algorithm) algorithm. The sentences in the article D are divided by calculating the relevance with the extracted keywords. Thus each keyword corresponds to a set of sentences s (v). When calculating the relevance between a sentence and a keyword, for example, the sentence and the keyword are vectorized by a TF-IDF (term frequency-inverse document frequency) algorithm, then the relevance is represented by the cosine similarity of the two, and finally the sentence is divided into the keyword with the highest relevance. Each keyword and its corresponding sentence set in the article D are defined as a structure module. All the structural modules are integrated to construct the structural model of article D. For the same text, one sentence is only subordinate to one keyword, one keyword is only subordinate to one structural module, and sentences subordinate to the same module constitute a sentence set of keywords in the module.

According to the method and the device, the incidence relation between the keywords and the sentences is established based on the similarity, the semantics of the keywords in the text can be more accurately determined, and the influence of language ambiguity on the matching result is reduced.

[ embodiment 4 ]

The text matching method provided by this embodiment includes all the contents in embodiment 2 or embodiment 3, and is not described herein again.

In this embodiment, before generating the similarity between the modules based on the weighted average of the first sub-similarity and the second sub-similarity, the method further includes: calculating the relevance between the keywords contained in each of two modules for carrying out module similarity calculation and a sentence set; comparing the average value of the correlation (correlation between the keywords and the sentence sets) of the two modules for performing module similarity calculation with a set threshold; and if the average value is less than or equal to a set threshold, setting the weight of the first sub-similarity to be less than the weight of the second sub-similarity.

For example, if the similarity between the module i and the module j is calculated, the correlation a between the keyword and the sentence set in the module i is calculated, the correlation b between the keyword and the sentence set in the module j is calculated, the average value of a and b is compared with a set threshold, and if the average value of a and b is smaller than the set threshold, the weight of the first sub-similarity is set to be smaller than the weight of the second sub-similarity.

[ embodiment 5 ]

The text matching method provided by this embodiment includes all the contents in embodiment 4, and is not described herein again. In the present embodiment, if the average of the above-described degrees of correlation (degrees of correlation between keywords and sentence sets) of the two modules performing the module similarity calculation is greater than a set threshold, the weight of the first sub-similarity is set to be equal to the weight of the second sub-similarity.

For example, if the average value of the foregoing a and b is greater than or equal to a set threshold, the weight of the first sub-similarity is set to coincide with the weight of the second sub-similarity.

[ embodiment 6 ]

The text matching method provided in this embodiment includes all the contents of any one of embodiment 2 to embodiment 5 (each of embodiment 2 to embodiment 5 is modified separately), and is not described again here.

In the embodiment, in the module similarity calculation process, the similarity of keywords between modules (first sub-similarity) and the similarity of sentence sets between modules (second sub-similarity) are calculated respectively through a TF-IDF algorithm to obtain TF-IDF cosine similarity between modules; respectively calculating the first sub-similarity and the second sub-similarity through a TF (term frequency) algorithm to obtain TF cosine similarity between modules; respectively calculating the first sub-similarity and the second sub-similarity by a BM25 algorithm (generally used for search relevance scoring, a main idea of a sentence summary: performing morpheme analysis on a question text to generate a keyword sequence, then calculating the relevance score of each keyword qi and d for each search result d, and finally performing weighted summation on the relevance scores of qi relative to d to obtain the relevance score of the question and d) to obtain BM25 cosine similarity between modules; calculating the first sub-similarity and the second sub-similarity respectively by a Jaccard algorithm (evaluating the similarity of the two sets by calculating the size of the intersection of the two sets divided by the size of the union) to obtain the Jaccard similarity between the modules; the first and second sub-similarities are calculated by the Ochiai algorithm (their similarity is evaluated by calculating the geometric mean of the size of the intersection of the two sets divided by the size of the two sets) to obtain the Ochiai similarities between the modules, respectively.

Accordingly, the modules are grouped based on the average values of the TF-IDF cosine similarity, the TF cosine similarity, the BM25 cosine similarity, the Jaccard similarity and the Ochiai similarity between the modules, for example, the two modules with the highest average value of the various similarities are grouped into a module group. Then, the two modules with the highest average value are selected from the rest of the modules to form a module group, and the process is repeated until each module is assigned to a certain module group (all modules are completely grouped).

[ embodiment 7 ]

The text matching method provided by this embodiment includes all the contents in embodiment 6, and is not described herein again. As shown in fig. 4, in the present embodiment, the process S240 is implemented by:

s241: and generating the similarity score of the text to be matched based on the similarity between the modules in each module group.

S242: and generating a matching result of the text to be matched according to the similarity score of the text to be matched.

In this embodiment, a similarity score of the text to be matched is generated based on the similarity between the modules in each module group according to the following formula:

wherein score is the similarity score; n is the number of the module groups; w is a_iThe weight parameter is the weight parameter of the ith module group; t is t_iThe similarity of TF-IDF cosine between the modules in the ith module group is obtained; s_iThe TF cosine similarity between the modules in the ith module group is obtained; m is_iThe cosine similarity of BM25 between modules in the ith module group; k is a radical of_iThe Jaccard similarity between the modules in the ith module group; j is a function of_iThe Ochiai similarity between modules in the ith module group; c. C_t、c_s、c_m、c_kAnd c_jThe weighting parameters are TF-IDF cosine similarity, TF cosine similarity, BM25 cosine similarity, Jaccard similarity and Ochiai similarity respectively.

[ embodiment 8 ]

The text matching method provided by this embodiment includes all the contents in embodiment 7, and is not described herein again.

In this embodiment, before generating the similarity score of the text to be matched based on the similarity between the modules in each module group, the method further includes: and determining the weight parameter of the module group based on the average value of the number of elements of the sentence sets in the module group.

For example, for the module group i, which includes the module a and the module B, the average value of the number of elements in the sentence set in the module group i is 1/2(NA + NB).

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 101, a communication interface 102, a memory 103, and a communication bus 104, where the processor 101, the communication interface 102, and the memory 103 complete mutual communication through the communication bus 14,

a memory 103 for storing a computer program;

the processor 101 is configured to implement the text matching method according to any one of embodiments 1 to 8 when executing the program stored in the memory 103.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which instructions are stored, and when the instructions are executed, the text matching method described in any one of the above embodiments 1 to 8 can be implemented.

In another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the text matching method described in any of embodiments 1 to 8 above.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of text matching, the method comprising:

obtaining a structural model of a text to be matched and a target text, wherein the structural model comprises a plurality of structural modules, and the structural modules comprise: keywords and corresponding sentence sets;

aiming at each structural module of the structural model of the text to be matched, identifying a corresponding structural module from the structural model of the target text respectively so as to construct a model block group;

and generating a matching result of the text to be matched according to the similarity between the modules in each module group.

2. The method according to claim 1, before generating the matching result of the text to be matched according to the similarity between the modules in each module group, the method further comprises:

calculating the similarity between the modules in each module group by the following processes:

calculating the similarity of the keywords between the modules as a first sub-similarity;

calculating the similarity of the sentence subsets between the modules as a second sub-similarity;

and generating the similarity between the modules based on the weighted average of the first sub-similarity and the second sub-similarity.

3. The method of claim 1, wherein prior to obtaining the structural models of the text to be matched and the target text, the method further comprises:

respectively constructing the structural models of the text to be matched and the target text by the following processes:

extracting key words from the text;

dividing sentences based on the correlation degree between the sentences and the keywords in the text, wherein one or more sentences divided into the same keyword form a sentence set corresponding to the keyword;

defining each keyword and a sentence set corresponding to the keyword as a structure module;

integrating all the structural modules to construct the structural model.

4. The method of claim 2, wherein before generating the similarity between modules based on the weighted average of the first sub-similarity and the second sub-similarity, the method further comprises:

aiming at each module in the module group, respectively calculating the correlation degree between the contained key words and the sentence set;

comparing the average value of the correlation degrees of the module group with a set threshold;

and if the first sub-similarity is smaller than or equal to the set threshold, setting the weight of the first sub-similarity to be smaller than the weight of the second sub-similarity.

5. The method of claim 4, further comprising:

and if the first sub-similarity is larger than the set threshold, setting the weight of the first sub-similarity to be consistent with the weight of the second sub-similarity.

6. The method according to claim 2, wherein the first sub-similarity and the second sub-similarity are calculated sequentially through a word frequency-inverse document frequency TF-IDF algorithm, a word frequency TF algorithm, a BM25 algorithm, a Jaccard algorithm, and an Ochiai algorithm, respectively, to obtain TF-IDF cosine similarity, TF cosine similarity, BM25 cosine similarity, Jaccard similarity, and Ochiai similarity between modules.

7. The method of claim 6, wherein generating the matching result of the text to be matched according to the similarity between the modules in each module group comprises:

generating a similarity score of the text to be matched based on the similarity between the modules in each module group according to the following formula:

generating a matching result of the text to be matched according to the similarity score of the text to be matched;

8. The method of claim 7, wherein before generating the similarity score for the text to be matched based on the similarity between the modules within each of the module groups, the method further comprises:

and determining the weight parameter of the module group based on the average value of the number of elements of the sentence sets in the module group.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 8 when executing a program stored in the memory.

10. A computer storage medium storing one or more computer instructions which, when executed, are capable of implementing the method of any one of claims 1 to 8.