CN114692594A

CN114692594A - Text similarity recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN114692594A
Application number: CN202210401749.1A
Authority: CN
Inventors: 王哲
Original assignee: Shanghai Himalaya Technology Co ltd
Current assignee: Shanghai Himalaya Technology Co ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-07-01

Abstract

The invention provides a text similarity recognition method, a text similarity recognition device, electronic equipment and a readable storage medium, wherein the text similarity recognition method comprises the following steps: acquiring a first text and a second text, and determining a common substring corresponding to the first text and the second text; respectively determining the similarity contribution proportion of the public substrings and the similarity contribution proportion of single characters except the public substrings in the first text and the second text according to the word sequences of the first text and the second text; and determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the common substrings and the similarity contribution proportion of the single characters. The invention not only considers the text word sequence, but also utilizes the contribution degree of the public sub-character strings and other single characters to the character string similarity of the first text and the second text, thereby improving the accuracy of the character string similarity and ensuring that the finally obtained character string similarity meets the expectation of a user.

Description

Text similarity recognition method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of search, in particular to a text similarity identification method and device, electronic equipment and a readable storage medium.

Background

The text similarity analysis is a problem of simultaneous key research in the fields of NLP and search, and is widely applied to scenes such as text mining, text clustering, search recall, search sequencing and the like, wherein the character string similarity has important significance in the text similarity analysis.

The existing character string similarity calculation method is mature in that an edit distance algorithm (Levenshtein distance) and a Jacard similarity coefficient ignore the word order problem and are only suitable for a text mode insensitive to the word order, and the edit distance algorithm is based on single character adding and deleting modification operation at the tail of a character string and only considers the absolute difference between the character strings, so that the error between the calculation result and the real similarity is large.

Therefore, how to provide a text similarity recognition method which is compatible with various text modes and has a high accuracy of a calculated similarity result is a problem to be solved.

Disclosure of Invention

An objective of the present invention is to provide a method, an apparatus, an electronic device and a readable storage medium for similarity recognition, which are compatible with various text modes and improve the accuracy of similarity of character strings.

In a first aspect, the present invention provides a text similarity recognition method, where the method includes: acquiring a first text and a second text, and determining a common substring corresponding to the first text and the second text; according to the word sequences of the first text and the second text, respectively determining the similarity contribution proportion of the public substrings and the similarity contribution proportion of single characters except the public substrings in the first text and the second text; and determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the public sub-character strings and the similarity contribution proportion of the single characters.

In a second aspect, the present invention provides a text similarity recognition apparatus, including: the acquisition module is used for acquiring a first text and a second text and determining a public substring corresponding to the first text and the second text; a determining module, configured to determine, according to word orders of the first text and the second text, a similarity contribution ratio of the common sub-character string and a similarity contribution ratio of a single character in the first text and the second text except the common sub-character string respectively; and the recognition module is used for determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the public sub-character strings and the similarity contribution proportion of the single character.

In a third aspect, the invention provides an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being capable of executing the computer program to implement the method of the first aspect.

In a fourth aspect, the invention provides a readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.

The invention provides a method and a device for identifying similarity, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a first text and a second text, and determining a common substring corresponding to the first text and the second text; according to the word sequences of the first text and the second text, respectively determining the similarity contribution proportion of the public substrings and the similarity contribution proportion of single characters except the public substrings in the first text and the second text; and determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the public sub-character strings and the similarity contribution proportion of the single characters. According to the method and the device, the public sub-character strings corresponding to the first text and the second text are obtained, the word orders of the first text and the second text are combined, the similarity contribution proportion corresponding to the public sub-character strings and the similarity contribution proportion of other single characters except the public sub-character strings to the first text and the second text are comprehensively considered, the finally obtained character string similarity not only considers the text word orders, but also utilizes the contribution degrees of the public sub-character strings and other single characters to the character string similarity of the first text and the second text, the accuracy of the character string similarity can be improved, and the finally obtained character string similarity meets the expectation of a user.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a scene diagram provided in an embodiment of the present invention;

fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a text similarity query method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of step S302 according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an edit sequence of a common substring according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an edit sequence of single characters according to the present invention;

FIG. 7 is a diagram illustrating a relationship between the number of blocking areas and a blocking value according to an embodiment of the present invention;

fig. 8 is a flowchart illustrating another text similarity recognition method according to an embodiment of the present invention;

fig. 9 is a flowchart illustrating another text similarity recognition method according to an embodiment of the present invention;

fig. 10 is a functional block diagram of a text similarity recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a diagram of a scenario provided by an embodiment of the present invention, which may be, but is not limited to: text mining, text clustering, search recalls, search ordering, and the like.

Referring to fig. 1, the application environment relates to a terminal 110 and a server 120, and the terminal 110 and the server 120 are connected through a network. A user may access a search engine through terminal 110 and server 120 may be the server on which the search engine resides. The terminal 110 or the server 120 may obtain query content "square _ 1" input by the user, and search for the square _ 1. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

The query _1 input by the user represents the question that the user wants to retrieve, and is a keyword which is summarized and refined and related to a specific question. After one search is initiated, the search engine recalls the documents according to the query _1 input by the user, returns a search result page, wherein the search result page comprises document lists of queries "query _ a", "query _ b", "query _ c", and "query _ d" and the like similar to the query _1, and returns the retrieved documents serving as a document set to the user, and the user can select to click the interested documents for consultation so as to meet the query requirement.

In the above various scenarios, the text similarity analysis is a key research problem, the text similarity analysis technology can be divided into two broad categories, the first category is similarity calculation in the forms of supervised learning, self-supervised learning and semi-supervised learning, and the similarity is measured by the cosine similarity of vectors encoded by Bert, and the character face similarity and the semantic similarity of character strings can be considered at the same time. The second type is an unsupervised form of similarity calculation, and the common methods are the Jacard similarity coefficient of character strings and the editing distance of character strings (the distance and the similarity are in an inverse relation).

It can be seen that although the first technical form can give consideration to both the character string similarity and the semantic similarity, it is not practical in some scenes, for example, in a search scene, it is desired to determine the correlation between the query and the recalled content title, and the query and the recalled content title have a relatively high character string similarity in principle, but because the query is usually relatively short and the content title is relatively long, the character string similarity calculated by the first technical form is often relatively low and not in accordance with the real situation, and further, considering a search platform which may have no high requirement for the semantic similarity, the user has a relatively large appeal for the accurate measurement of the character string similarity, and therefore, it has an important meaning in the text similarity analysis to study the character string similarity.

The existing character string similarity calculation method is mature and comprises an edit distance algorithm (Levenshtein distance) and a Jacard similarity coefficient.

The Jacard similarity coefficient is calculated by using the text intersection of two texts, neglects the word order problem, and is only applicable to text modes with insensitive word order, such as 'retail wholesale' and 'retail wholesale', and the Jacard similarity coefficient can be well compatible, but for scenes with sensitive word order, such as 'one nine three eight years' and 'one eight three nine years', the Jacard similarity is 100%, but is far from the true similarity between the two texts.

The edit distance algorithm is based on that single character adding, deleting and modifying operations are carried out at the tail part of a character string, only absolute differences between the character string are considered, so that the error between a calculation result and the real similarity is large, for example, for two texts of 'Shanghai magic city' and 'magic city Shanghai', the real similarity is high, but the edit distance calculated through the edit distance algorithm is large, so that the similarity of the character string is very small, and the error between the character string and the real similarity is large.

It should be noted that the "true similarity" referred to in the embodiment of the present invention refers to a similarity that satisfies the reasonable expectation of the user, and the reasonable expectation is that the excluded results are summarized under a great deal of experience, for example, "shanghai magic man" and "magic man shanghai", the similarity expected by the user should be high, and the similarity expected by the user should be low for "one nine and three eight years" and "one eight and three nine years".

In order to solve the defect of calculation of the similarity of the character strings, how to provide a text similarity recognition method which is compatible with various text modes and has accurate and high calculated similarity results is provided.

First, an execution device of the text similarity recognition method provided by the embodiment of the present invention is introduced.

Referring to fig. 2, fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention. The electronic device may be the server 120 in fig. 1. The electronic device 200 may include: comprising a memory 201, a processor 202 and a communication interface 203, the memory 201, the processor 202 and the communication interface 203 being electrically connected to each other, directly or indirectly, to enable transmission or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 201 may be used to store software programs and modules, such as instructions/modules of the text similarity recognition apparatus 400 provided in the embodiment of the present invention, which may be stored in the memory 201 in the form of software or firmware (firmware) or be fixed in an Operating System (OS) of the electronic device 200, and the processor 202 executes the software programs and modules stored in the memory 201, so as to execute various functional applications and data processing. The communication interface 203 may be used for communication of signaling or data with other node devices.

The memory 201 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an erasable read only memory (EPROM), an electrically erasable read only memory (EEPROM), and the like.

The processor 202 may be an integrated circuit chip having signal processing capabilities. The processor 202 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

It will be appreciated that the configuration shown in fig. 2 is merely illustrative and that electronic device 200 may include more or fewer components than shown in fig. 2 or may have a different configuration than shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.

Referring to fig. 3, fig. 3 is a schematic flow chart of a text similarity query method according to an embodiment of the present invention, where text similarity (or similarity) according to the present invention only refers to character string similarity, and semantic similarity is not considered in the scope of the embodiment of the present invention, the method may use the electronic device shown in fig. 2 as an execution main body, and includes:

s301, acquiring the first text and the second text, and determining a common substring corresponding to the first text and the second text.

In this embodiment, the common substring represents the same character component in the first text and the second text, and the text length of the same character component is greater than or equal to 2.

For example, if the first text is "shanghai is magic, and it is called a ten-miles arena in the old times" and the second text is "magic shanghai, which is called a ten-miles arena in the old times", then the common substrings corresponding to the first text and the second text are: shanghai, Modu, old age, Ten Ri ocean grounds; for another example, if the first text and the second text are "boxhorse fresh" and "boxhorse fresh", respectively, the common substring is boxhorse.

In an alternative embodiment, the first text and the second text may be cut by using N-Gram to obtain corresponding sentence components (Gram), and then the corresponding grams of the first text and the second text are compared in component consistency, so that the common substring may be obtained.

For example: the first text is "cell gate", the second text is "gate owner", for example: and cutting a first text 'cell gate' by using 3-Gram, wherein the obtained Gram is as follows: the large, small gate, large gate; and cutting the second text 'gate east' by using 3Gram to obtain the Gram which specifically comprises the following steps: the public substring after the component consistency comparison of gate, large, door east, door, east is: a gate.

S302, according to the word sequences of the first text and the second text, the similarity contribution proportion of the common substrings and the similarity contribution proportion of the single characters except the common substrings in the first text and the second text are respectively determined.

In the embodiment of the present invention, the word order represents the relative position relationship between each character or character string in the text, and the word order has a direct influence on the semantics of the text, for example, when the text "one nine eight three years" is compared with the text "one nine three eight years", two texts have the same character, but the semantics are greatly different.

In this embodiment, the similarity contribution ratio of the common substring represents the common substring itself and the word order difference information of the common substring in the first text and the second text, and the contribution ratio to the string similarity of the first text and the second text, and similarly, the similarity contribution ratio of the single character represents the single character string itself and the word order difference information of the single character in the first text and the second text, and the contribution ratio to the string similarity of the first text and the second text.

It can be understood that if the actual string similarity of two texts is high, they should contain more identical character components, and the contribution ratio of the single character except the identical character component to the string similarity should be far smaller than that of the identical character component.

That is, for two texts, if the contribution ratio of the same character component is high, the string similarity of the two texts should be theoretically high, and conversely, if the two texts do not contain the same character component or are few, the string similarity of the two texts is correspondingly low.

And calculating the contribution proportion of the public substring and other characters except the public substring to the similarity of the character strings, so that the finally obtained similarity of the character strings accords with the actual situation.

And S303, determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the public sub-character strings and the similarity contribution proportion of the single characters.

In this embodiment, the similarity contribution ratio of the common substring and the similarity contribution ratio of the single character are added to obtain the string similarity corresponding to the first text and the second text.

According to the text similarity recognition method provided by the embodiment of the invention, the common sub-character strings corresponding to the first text and the second text are obtained, and the word orders of the first text and the second text are combined, so that the similarity contribution ratio corresponding to the common sub-character strings and the similarity contribution ratio of other single characters except the common sub-character strings to the first text and the second text are comprehensively considered, the finally obtained character string similarity not only considers the text word orders, but also utilizes the contribution ratios of the common sub-character strings and other single characters to the character string similarity of the first text and the second text, the accuracy of the character string similarity can be improved, and the finally obtained character string similarity meets the expectation of a user.

The text similarity recognition method can be applied to various scenes such as text mining, text clustering, search recalling, search sequencing and the like, and the method is described in detail below by taking the search sequencing scene as an example.

The server 120 obtains a statement to be queried (query) sent from the terminal 110, that is, a first text, and retrieves a target rewrite statement (tagertquaternary) corresponding to the query from a library of the queries, that is, a second text, where the tagertquaternary and the query satisfy a preset similarity threshold, and generally, one of the queries may correspond to multiple tagertquays, and then the multiple tagertquays may all serve as the second text.

For each tagertquaternary, the server 120 cuts the quaternary and tagertquaternary using the N-Gram to obtain their respective corresponding character components.

The server 120 may perform component consistency comparison on the character components corresponding to the square and tagertquad, and determine the same character component as a common substring corresponding to the uary and tagertquad.

Then, the server 120 obtains the word order information of the common substrings in the query and tagertquery, respectively, calculates the similarity contribution ratio corresponding to the common substrings, then obtains the word order information of the single characters except the common substrings, and calculates the similarity contribution ratio corresponding to the single characters.

For example, a first example is that the square is "shanghai is a magic city, which is called a ten-therein field when old," tagertquad is "a magic city which is called a ten-therein field when old," shanghai, "magic city," "old," and "ten-therein field" are the common substrings, and if the common substrings are replaced with ABCDs, respectively, the square may be abbreviated as "a is B and C is D," tagertquad may be abbreviated as "C is BA which is called D," then the common substring has an order of "A, B, C, D" in the square, the common substring has an order of "C, D, B, A" in the tagertquad, and the same may obtain the order information in the single-character squares and the tagertquad, respectively.

Finally, the server 120 sums the similarity contribution ratios corresponding to the common substrings and the similarity contribution ratios corresponding to the single characters, so that the string similarity between the sentence to be queried and any one of the target rewrite sentences can be obtained, the arrangement policy of the documents corresponding to all the target rewrite sentences can be specified by combining the similarity of the string corresponding to each target rewrite sentence, the arrangement policy is sent to the terminal 110, the documents corresponding to the target rewrite sentences are sequentially displayed by the terminal 110 according to the arrangement policy, and it can be considered that the target rewrite sentences having the greater similarity with the character strings of the quay are located at the top in the ordering result.

The following embodiment of the present invention will describe in detail how to calculate the similarity contribution ratio of the common substring and the similarity contribution ratio of the single characters in the first text and the second text except the common substring.

Referring to fig. 4, fig. 4 is a schematic flowchart of step S302 according to an embodiment of the present invention, and as shown in fig. 4, step S302 may include the following sub-steps:

substep S302-1 determines a minimum number of word order adjustments of the common substring when the first word order of the common substring is adjusted to the second word order according to the first word order of the common substring in the first text and the second word order of the common substring in the second text, respectively.

In the embodiment of the present invention, the first language order represents the relative position relationship of each common substring in the first text, and similarly, the second language order represents the relative position relationship of each common substring in the second text. The above-mentioned minimum number of times of word order adjustment represents the minimum number of times of editing operation required for adjusting the relative position relationship of the common substring in the first text to the relative position relationship of the common substring in the second text, which may also be referred to as a word order adjustment cost, and the editing operation may be addition, deletion, or modification.

For example, continuing with the first example, if the first language order is "A, B, C, D" and the second language order is "C, D, B, A", then the minimum number of moves or swaps required to convert ABCD to CDBA is the minimum number of language order adjustments.

In an alternative embodiment, the manner of determining the minimum number of word order adjustments of the common substring may refer to the steps of: shown in the figure:

a1, obtaining a first character string corresponding to the first text and a second character string corresponding to the second text; wherein, the first character sequence comprises a public substring and a first language sequence; the second string includes a common substring and a second language order.

In this embodiment, the first character string is a character string composed of editing characters corresponding to the common substring and a relative positional relationship of the common substring in the first text, and the second character string is a character string composed of the common substring and a relative positional relationship of the common substring in the second text. The common substrings are replaced by the editing characters, so that the editing sequence can be conveniently and quickly obtained subsequently, and the editing efficiency is improved.

In an alternative embodiment, the present invention implements a process of obtaining the first character string and the second character string, first replacing the common substring with a special character, and then combining the special characters representing the common substring according to a relative positional relationship between the common substrings, so as to obtain the first character string and the second character string, where, for example, the first character string is ABCD in the first example, and the second character string is CDBA in the first example, A, B, C, D are editing characters corresponding to the common substrings respectively.

Therefore, the subsequent process of adjusting the word sequence not only improves the editing operation efficiency, but also does not need to split the same sentence component, thereby saving the editing operation steps and avoiding the problem of larger error of the calculation result due to the fact that the current distance editing algorithm is based on single character editing.

For example, assuming that the first text and the second text are "boxhorse fresh" and "fresh boxhorse", since the existing edit distance algorithm has the limitation that only the increase and deletion can be performed at the end of each step, at least 4 times of exchange are required to convert the "boxhorse fresh" into the "fresh boxhorse", and by using the method of the embodiment of the present invention, the first text is converted into the first character string EF, and the second text is converted into the second character string FE, only one time of conversion is required.

a2, carrying out editing conversion on the public substrings in the first character string, and obtaining an editing sequence of the public substrings until the converted first character string is consistent with the second character string;

the editing sequence is an optimal editing sequence obtained by editing and converting each character in the character string by using different editing modes through a dynamic programming algorithm, and is used for maintaining a plurality of editing modes corresponding to the common substring, the plurality of editing modes have a sequential execution sequence, and the editing modes comprise an insertion mode (ins), a replacement mode (sub) and a deletion mode (del).

For example, continuing with the first example described above, if the first string and the second string are ABCD and CDBA, respectively, then the edit sequence for converting "ABCD" into "CDBA" can be referred to as shown in fig. 5, where fig. 5 is a schematic diagram of the edit sequence of the common substring provided by the embodiment of the present invention.

As shown in fig. 5, the process of converting "ABCD" to "CDBA" is described in an editing sequence (editSequence), wherein "sub" - > "a- > C" characterizes an editing mode, i.e., a is replaced by C.

a3, counting the editing mode pairs with symmetrical relation from the editing sequence, and setting the number of the editing mode pairs as the minimum word order adjusting times.

In the embodiment of the invention, the same or a pair of public substrings are subjected to front and back two opposite editing conversions, so that two editing modes corresponding to the two opposite editing conversions can form an editing mode pair, namely, after one-time replacement of a certain two characters, the two opposite editing modes are also required to be replaced again, after the certain character is inserted, the character is required to be deleted, and after the certain character is deleted, the character is required to be inserted.

For example, with continued reference to fig. 5, "sub-a-C" and "sub-C-a" may form an editing mode pair having a symmetric relationship, and represent that a is replaced by C, and then C is replaced by a, and for example, "ins-D" and "del-D" may also form an editing mode pair having a symmetric relationship, that is, D is inserted first and D is deleted.

It can be understood that, the character sequence of a character string is randomly scrambled to form a new character string, the edited sequence from the original character string to the new character string will exhibit the symmetry rule, and if the symmetry rule is not satisfied, the big premise that the new character string is obtained by randomly adjusting the sequence of the original character string will be violated). Therefore, the minimum number of times of order adjustment can be obtained by counting the total number of times of occurrence of such symmetry phenomena, that is, the number of pairs of editing modes having a symmetric relationship.

It should be noted that in some scenarios, the replacement editing mode may not appear symmetrically in the editing sequence, but the replacement editing mode may be equivalently split into a deletion editing mode and an insertion editing mode having a symmetrical relationship once, and then can be regarded as appearing symmetrically.

To this end, the minimum number of word order adjustments for word order adjustment of the common substring in the first text and the second text can be obtained through the above steps a1 to a3, and then the contribution value of the common substring can be obtained by combining the length information of the common substring and the first text.

The processing of the single character is described below, with continued reference to substep S302-2.

And a substep S302-2, determining a character conversion blocking value between the first single character and the second single character and the number of the same single characters according to the third language order of the first single character in the first text and the fourth language order of the second single character in the second text.

In the embodiment of the present invention, for convenience of description, each single character in the first text is referred to as a first single character, each single character in the second text is referred to as a second single character, the third language order represents the relative position relationship between the first single characters except the common sub-character string in the first text, and similarly, the fourth language order represents the relative position relationship between the second single characters except the common sub-character string in the second text.

The above-mentioned same single character is a single character having the same character but not requiring editing conversion, and is simply understood as a single character having the same character and the same word sequence, for example, the text "round head of kitten" and the text "beard of kitten is long, wherein" the word sequence of two texts is the same, and not requiring editing conversion, and may be a same single character, but "true" has the same character but different word sequence, and therefore is not a same single character.

It can be understood that, for two texts, if no word order adjustment is needed between two single characters, that is, the two single characters are located in close positions in the two texts, the contribution of the single character to the similarity of the character string is positive at this time, otherwise, the two single characters do not contribute to the similarity of the character string, and the positive contribution of the single character string can be determined by counting the number of the same single characters.

In an alternative embodiment, the sub-step S302-1 may be performed as follows:

b1, obtaining a third character string corresponding to the first text and a fourth character string corresponding to the second text; wherein, the third character sequence comprises a first single character and a third language sequence; the fourth string includes a second single character and a fourth language order.

In order to facilitate the subsequent editing and conversion of the single character, before the third character string and the fourth character string are obtained, the punctuation type symbols in the first text and the second text may be converted into symbols in a first preset format, for example, the punctuation type symbols are replaced with "_", and similarly, in order to avoid the influence of the common sub-character strings on the single character conversion result, the common sub-character strings in the first text and the second text may be converted into symbols in a second preset format, for example, the common sub-character strings may be replaced with "#", and after the preprocessing, the third character string and the fourth character string are obtained.

It can be understood that the third character string is the first text after preprocessing the common substring and the punctuation type character in the first text, and similarly, the fourth character string is the second text after preprocessing the common substring and the punctuation type character in the second text.

b2, carrying out editing conversion on the first single character in the third character string until the converted third character string is consistent with the fourth character string, and obtaining an editing sequence; the editing sequence is used for maintaining a plurality of editing modes for converting the first single character into the second single character.

For example, continuing with the first example, the first text is "shanghai is majors, which is called a ten-miles arena when old," the second text is "majors shanghai, which is called a ten-miles arena when old," the common substrings are all replaced with a special character "#", then the third string is "# which is called a #", and the fourth string is "# which is called a # #of #".

Referring to fig. 6, fig. 6 is a schematic diagram of a single character edit sequence provided by the present invention, and fig. 6 is an edit sequence obtained by taking two texts as an example in the first example, it can be seen that "# is #," is called # ", and" is called # by "# is called #" can be converted into "# is called # by" # "by a plurality of edit modes in fig. 6.

b3, determining the character conversion obstruction value and the number of the same single characters according to the type of the editing mode between the first single character and the second single character from the editing sequence.

In an alternative embodiment, the step b3 can be performed as follows:

b3-1, counting the total number of the blocked areas corresponding to the non-empty editing modes from the editing sequence.

Wherein the non-empty edit mode represents any one of the following edit modes: a replacement mode, a deletion mode, and an insertion mode.

Continuing with fig. 6, the replacement mode, the deletion mode, and the insertion mode can be regarded as obstacles, and the more obstacles occur in succession, the larger the value of the obstacle is, and there are 4 obstacles before the first character and after the last character in fig. 6.

It should be noted that the blocking area and the blocking value are both a modeling tool for determining the character string similarity calculation model according to the embodiment of the present invention, and have no other meaning.

b3-2, obtaining a statistical sequence according to the total number of the blocking areas, and calculating a character conversion blocking value based on the corresponding relation between the preset total number of the blocking areas and the blocking value.

In the embodiment of the present invention, the statistical sequence is obtained based on the total number of blocking regions in step b3-1, and the sequence value in the statistical sequence is an integer value between 1 and the total number of blocking regions (including 1 and the total number of blocking regions). For example, if there are 4 blocking regions in fig. 6, the statistical sequence is [1,2,3,4], and each sequence value in the statistical sequence can be regarded as a number of blocking regions.

In an alternative implementation manner, the corresponding relationship between the number of blocking areas and the blocking value may be as shown in fig. 7, where fig. 7 is a schematic diagram of the corresponding relationship between the number of blocking areas and the blocking value according to an embodiment of the present invention, as shown in fig. 7, where cnt represents the total number of blocking areas, and value represents the blocking value, when cnt is less than or equal to 2, the corresponding blocking value is 1, when cnt is greater than 2 and less than or equal to 4, the corresponding blocking value is 1, when cnt is greater than 4 and less than or equal to 6, the corresponding blocking value is 1.5, and when cnt is greater than 6, the corresponding blocking value is 2.

According to the corresponding relationship shown in fig. 7, the character conversion blocking value may be obtained by statistically summing the number of each blocking area in the statistical sequence and the blocking value corresponding to each blocking area, and in an alternative embodiment, the character conversion blocking value may be obtained according to the following relation:

wherein the obsstructValue represents a character conversion barrier value; the spanList is used to record a temporary table of blocking values corresponding to the number of blocking areas, for example, the spanList [0] records a blocking value of 0.5 corresponding to the number of 1 blocking area, and so on; i represents the ith sequence value in the statistical sequence;

b3-3, regarding the number of the editing modes except the target editing mode as the same single character number from the editing sequence, wherein the target editing mode is a non-empty editing mode and an editing mode with a common substring.

It should be noted that, since the forward contribution of a single character must be in the edited sequence without deletion or modification, which means that the matched single characters must be located close to each other in the original text, otherwise there is no contribution, it is considered here that the same number of single characters is counted from the edited sequence instead of the original sequence, and the same number of single characters (singleCharNum) in fig. 6 is 0.

And a substep S302-3 of determining a similarity contribution ratio of the common substring based on the minimum word order adjustment times, the total number of characters of the common substring and the length information of the first text, and determining a similarity contribution ratio of the single character based on the character conversion barrier value, the number of the same single character and the length information of the first text.

In the embodiment of the present invention, the similarity contribution ratio of the common substring, the minimum word order adjustment frequency, and the length information of the common substring and the first text may satisfy the following relation:

wherein, similarity (gram) represents the similarity contribution proportion of the public substrings; the gramContrib characterizes the total number of characters of the common substring; the queryLength represents length information of the first text; the minimum language order adjustment times are represented by minimumOrderAdjustmCost; p represents a constant coefficient.

Similarly, the similarity contribution ratio of a single character can be obtained according to the following relationship:

wherein, similarity (gram) represents the similarity contribution proportion of single characters; singleCharNum represents the number of the same single characters; the queryLength represents length information of the first text; the obsstructValue represents a character conversion barrier value; p represents a constant coefficient.

Then the similarity of the character strings corresponding to the first text and the second text, which can be finally obtained through the above two relations, can be expressed as:

in an alternative embodiment, after obtaining the similarity between the two texts, the similarity between the two texts may be further explained, please refer to fig. 8, where fig. 8 is a schematic flow diagram of another text similarity recognition method provided in an embodiment of the present invention, and the method may further include:

s304, obtaining a plurality of preset similarity measurement intervals and a similarity label corresponding to each similarity measurement interval.

S305, matching the character string similarity with a plurality of similarity measurement intervals, and configuring the similarity labels corresponding to the matched similarity measurement intervals to the first text and the second text.

In the embodiment of the present invention, the similarity measurement interval and the similarity label may be defined according to actual requirements, and in the definition process, the character string similarity calculated in the embodiment of the present invention may be divided into intervals with the user, in an optional implementation manner, the multiple similarity measurement intervals may be 0 to 0.2, 0.2 to 0.35, 0.35 to 0.5, 0.5 to 0.65, 0.65 to 0.8, and 0.8 to 1, and the similarity labels corresponding to the multiple similarity measurement intervals are: very dissimilar, somewhat similar, very similar, almost synonymous.

For ease of understanding, some of the string similarities between the first text and the second text, and the matching similarity measure intervals, similarity labels are listed below:

| A Ginger dentition (small teeth) 0.4166666666666667, 0.35-0.5, somewhat similar;

sword emperor independent bottle, namely Sword way independent bottle, 0.5833333333333334, 0.5-0.65, and the like;

money goes which > money goes where 0.8666666666666667, 0.8-1, is almost synonymous.

Shanghai is the magic city, which is called the Shilei arena in the old age > the magic city called the Shilei arena in the old age: 0.52991452991453, 0.5-0.65, similar.

In an optional implementation manner, when there are a plurality of second texts, an arrangement order between the plurality of second texts may be further determined based on the obtained character string similarity, where the arrangement order may be a reference basis for how to display on a display interface of a terminal, please refer to fig. 9, where fig. 9 is a schematic flow diagram of another text similarity recognition method provided in an embodiment of the present invention, and the method may further include:

s306, when the second texts have a plurality of character string similarities, the character string similarity between each second text and the first text is obtained;

and S307, determining a similarity arrangement strategy of each second text based on the character string similarity between each second text and the first text.

Based on the same inventive concept, fig. 10 is a functional block diagram of a text similarity recognition apparatus according to an embodiment of the present invention, and referring to fig. 10, the text similarity recognition apparatus 400 includes: an acquisition module 410, a determination module 420, and an identification module 430.

An obtaining module 410, configured to obtain a first text and a second text, and determine a common substring corresponding to the first text and the second text;

a determining module 420, configured to determine, according to word orders of the first text and the second text, a similarity contribution ratio of the common substring and a similarity contribution ratio of a single character in the first text and the second text except the common substring respectively;

and the identifying module 430 is configured to determine the similarity of the character strings between the first text and the second text according to the similarity contribution ratio of the common sub-character strings and the similarity contribution ratio of the single character.

It is to be appreciated that the obtaining module 410, the determining module 420, and the identifying module 430 can cooperatively perform the various steps of fig. 3 to achieve the corresponding technical effect.

In an optional embodiment, the determining module 420 is specifically configured to: determining the minimum word order adjusting times of the public substrings when the first word order of the public substrings is adjusted into the second word order according to the first word order of the public substrings in the first text and the second word order of the public substrings in the second text; determining a character conversion barrier value between the first single character and the second single character and the number of the same single characters according to a third language sequence of the first single character in the first text and a fourth language sequence of the second single character in the second text; and determining the similarity contribution ratio of the common substring based on the minimum word order adjustment times and the length information of the common substring and the first text, and determining the similarity contribution ratio of the single characters based on the character conversion barrier value, the number of the same single characters and the length information of the first text.

In an optional embodiment, the determining module 420 is specifically configured to: determining the minimum word order adjusting times of the public substring when the first word order of the public substring is adjusted to the second word order according to the first word order of the public substring in the first text and the second word order of the public substring in the second text, respectively, including: obtaining a first character string corresponding to the first text and a second character string corresponding to the second text; wherein, the first character sequence comprises a public substring and a first language sequence; the second character string comprises a public substring and a second language order; carrying out editing conversion on the public substrings in the first character string until the converted first character string is consistent with the second character string, and obtaining an editing sequence of the public substrings; the editing sequence is used for maintaining a plurality of editing modes corresponding to the common substrings; and counting the editing mode pairs with the symmetrical relation from the editing sequence, and taking the number of the editing mode pairs as the minimum word order adjusting times.

In an optional embodiment, the determining module 420 is specifically configured to: determining a character conversion barrier value between the first single character and the second single character and the number of the same single characters according to a third language sequence of the first single character in the first text and a fourth language sequence of the second single character in the second text; the method comprises the following steps: obtaining a third character string corresponding to the first text and a fourth character string corresponding to the second text; wherein, the third character sequence comprises a first single character and a third language sequence; the fourth character string comprises a second single character and a fourth language order; editing and converting the first single character in the third character string until the converted third character string is consistent with the fourth character string to obtain an editing sequence; wherein, the editing sequence is used for maintaining a plurality of editing modes for converting the first single character into the second single character; from the edit sequence, a character conversion inhibition value and the number of the same single characters are determined according to an edit mode type between the first single character and the second single character.

In an optional embodiment, the determining module 420 is specifically configured to: counting the total number of the blocking areas corresponding to the non-empty editing modes from the editing sequence; wherein the non-empty edit mode represents any one of the following edit modes: a replacement mode, a deletion mode, and an insertion mode; obtaining a statistical sequence according to the total number of the blocking areas, and calculating a character conversion blocking value based on the corresponding relation between the preset total number of the blocking areas and the blocking value; and taking the number of the editing modes except for a target editing mode as the same single character number from the editing sequence, wherein the target editing mode is a non-empty editing mode and an editing mode with a common substring.

In an optional embodiment, the text similarity recognition apparatus 400 further includes a configuration module, and the obtaining module 410 is further configured to obtain a plurality of preset similarity measurement intervals and a similarity label corresponding to each similarity measurement interval; the configuration module is used for matching the character string similarity with the similarity measurement intervals and configuring the similarity labels corresponding to the matched similarity measurement intervals to the first text and the second text;

in an optional embodiment, the obtaining module 410 is further configured to, when a plurality of second texts exist, respectively obtain a character string similarity between each second text and the first text; the determining module 420 is further configured to determine a similarity ranking policy for each second text based on the character string similarity between each second text and the first text.

An embodiment of the present invention further provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the query sentence library construction method according to any one of the foregoing embodiments. The computer readable storage medium may be, but is not limited to, various media that can store program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a magnetic or optical disk, etc.

It should be understood that the disclosed apparatus and method may be embodied in other forms. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Claims

1. A text similarity recognition method is characterized by comprising the following steps:

acquiring a first text and a second text, and determining a common substring corresponding to the first text and the second text;

according to the word sequences of the first text and the second text, respectively determining the similarity contribution proportion of the public substrings and the similarity contribution proportion of single characters except the public substrings in the first text and the second text;

and determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the public sub-character strings and the similarity contribution proportion of the single characters.

2. The text similarity recognition method according to claim 1,

determining a similarity contribution ratio of the common substring and a similarity contribution ratio of the single characters in the first text and the second text except the common substring according to the word order of the first text and the second text, respectively, including:

determining the minimum word order adjusting times of the public substrings when the first word order of the public substrings is adjusted into the second word order according to the first word order of the public substrings in the first text and the second word order of the public substrings in the second text;

determining a character conversion barrier value between a first single character and a second single character and the number of the same single characters according to a third language sequence of the first single character in the first text and a fourth language sequence of the second single character in the second text;

and determining the similarity contribution proportion of the public substrings based on the minimum word order adjustment times and the length information of the public substrings and the first text, and determining the similarity contribution proportion of the single characters based on the character conversion barrier value, the number of the same single characters and the length information of the first text.

3. The text similarity recognition method according to claim 2,

determining the minimum number of word order adjustments of the common substring when the first word order of the common substring is adjusted to the second word order according to a first word order of the common substring in the first text and a second word order of the common substring in the second text, respectively, including:

obtaining a first character string corresponding to the first text and a second character string corresponding to the second text; wherein the first string comprises the common substring and the first language order; the second string comprises the common substring and the second language order;

editing and converting the public substrings in the first character string until the converted first character string is consistent with the second character string, and obtaining an editing sequence of the public substrings; the editing sequence is used for maintaining a plurality of editing modes corresponding to the public substring;

and counting the editing mode pairs with the symmetrical relation from the editing sequence, and taking the number of the editing mode pairs as the minimum word order adjusting times.

4. The text similarity recognition method according to claim 2,

determining a character conversion barrier value and the number of the same single characters between the first single character and the second single character according to a third language sequence of the first single character in the first text and a fourth language sequence of the second single character in the second text; the method comprises the following steps:

obtaining a third character string corresponding to the first text and a fourth character string corresponding to the second text; wherein the third string comprises the first single character and the third language order; the fourth character string comprises the second single character and the fourth language order;

editing and converting the first single character in the third character string until the converted third character string is consistent with the fourth character string, and acquiring an editing sequence; wherein the editing sequence is to maintain a plurality of editing modes for converting a first single character into the second single character;

and determining the character conversion barrier value and the number of the same single characters from the editing sequence according to the type of the editing mode between the first single character and the second single character.

5. The text similarity recognition method according to claim 4, wherein determining the character conversion inhibition value and the number of the same single characters from the edit sequence according to the type of the edit mode between the first single character and the second single character comprises:

counting the total number of the blocking areas corresponding to the non-empty editing modes from the editing sequence; wherein the non-empty edit mode characterizes any one of the following edit modes: a replacement mode, a deletion mode, and an insertion mode;

obtaining a statistical sequence according to the total number of the blocking areas, and calculating the character conversion blocking value based on the corresponding relation between the preset total number of the blocking areas and the blocking value;

and taking the number of editing modes except for a target editing mode as the number of the same single characters from the editing sequence, wherein the target editing mode is the non-empty editing mode and the editing mode with the common substrings.

6. The text similarity recognition method according to claim 1, further comprising:

obtaining a plurality of preset similarity measurement intervals and a similarity label corresponding to each similarity measurement interval;

and matching the character string similarity with the similarity measurement intervals, and configuring the similarity labels corresponding to the matched similarity measurement intervals to the first text and the second text.

7. The text similarity recognition method according to claim 1, further comprising:

when a plurality of second texts exist, respectively obtaining the character string similarity between each second text and the first text;

determining a similarity ranking policy for each second text based on the string similarity between the each second text and the first text.

8. A text similarity recognition apparatus, comprising:

the acquisition module is used for acquiring a first text and a second text and determining a public substring corresponding to the first text and the second text;

a determining module, configured to determine, according to word orders of the first text and the second text, a similarity contribution ratio of the common substring and a similarity contribution ratio of a single character in the first text and the second text except the common substring respectively;

and the recognition module is used for determining the character string similarity between the first text and the second text according to the similarity contribution proportion of the public sub-character strings and the similarity contribution proportion of the single character.

9. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being operable to execute the computer program to implement the method of any one of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.