CN113988045B - Text similarity determining method, text processing method, corresponding device and equipment - Google Patents

Text similarity determining method, text processing method, corresponding device and equipment Download PDF

Info

Publication number
CN113988045B
CN113988045B CN202111620649.XA CN202111620649A CN113988045B CN 113988045 B CN113988045 B CN 113988045B CN 202111620649 A CN202111620649 A CN 202111620649A CN 113988045 B CN113988045 B CN 113988045B
Authority
CN
China
Prior art keywords
participle
text
similarity
determining
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111620649.XA
Other languages
Chinese (zh)
Other versions
CN113988045A (en
Inventor
许斯军
田正中
李小可
张俊鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Koubei Network Technology Co Ltd
Original Assignee
Zhejiang Koubei Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Koubei Network Technology Co Ltd filed Critical Zhejiang Koubei Network Technology Co Ltd
Priority to CN202111620649.XA priority Critical patent/CN113988045B/en
Publication of CN113988045A publication Critical patent/CN113988045A/en
Application granted granted Critical
Publication of CN113988045B publication Critical patent/CN113988045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the application provides a text similarity determining method, a text processing method, corresponding devices and equipment, the method comprises the steps of obtaining the weight of each participle in a first participle set, determining the distance between the participle and the closest participle in a second participle set, obtaining the weight of the participle for each participle in the second participle set, determining the distance between the participle and the closest participle in the first participle set, further determining the similarity between a first text and a second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set, determining the similarity between the two texts by combining the angle of the first participle set and the angle of the second participle set, and combining the weight representing the importance of each participle, the accuracy of the similarity determination result can be remarkably improved.

Description

Text similarity determining method, text processing method, corresponding device and equipment
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text similarity determining method, a text processing method, and corresponding devices and apparatuses.
Background
Text similarity calculation is a common Natural Language Processing (NLP) means, aims to determine the degree of correlation between different texts, and has a very wide application prospect in the fields of data mining, data classification, information retrieval, information filtering, machine translation and the like.
In the prior art, how to calculate the similarity of various texts is a hot research problem in the industry. Although there are many different text similarity determination methods in the prior art, the effect of each method is to be improved.
Disclosure of Invention
The purpose of the present application is to solve at least one of the above technical drawbacks, especially the technical drawback that the existing text similarity determination method is not good enough.
In a first aspect, the present application provides a text similarity determining method, including:
acquiring a first text and a second text of which the similarity is to be determined;
performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text;
aiming at each participle in the first participle set, acquiring the weight of the participle, and determining the distance from the participle to the nearest participle in the second participle set;
for each participle in the second participle set, acquiring the weight of the participle, and determining the distance from the participle to the participle closest to the first participle set;
and determining the similarity of the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set.
In an optional implementation manner, determining similarity between the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set includes:
summing products of weights corresponding to the participles in the two participle sets and corresponding distances to obtain a first summation result;
summing the weights corresponding to the participles in the two participle sets to obtain a second summation result;
and dividing the first summation result and the second summation result to obtain a final distance for representing the similarity of the two texts.
In an optional implementation manner, summing products of weights and corresponding distances corresponding to respective participles in two participle sets to obtain a first summation result, including:
respectively extracting participles with weights larger than a first threshold value from the two participle sets to obtain corresponding participle sub-sets of the two participle sets;
if the two word segmentation sub-sets do not intersect, determining whether each word segmentation in the two word segmentation sets is similar to the word segmentation closest to the word segmentation set in the other word segmentation set, wherein the similarity of the two word segmentation means that the distance between the two word segmentation sets is smaller than a second threshold value;
for determining the participle similar to the participle closest to the participle in the other participle set, subtracting the weight of the participle from a preset numerical value to obtain the inverse weight of the participle;
for determining the participle which is not similar to the participle closest to the other participle set, taking the weight of the participle as the reversal weight of the participle;
and summing products of the inversion weights and the corresponding distances corresponding to the participles in the two participle sets to obtain a first summation result.
In an optional implementation manner, for each participle in the first participle set and the second participle set, determining a distance from the participle to a nearest participle in another participle set, including:
determining whether a similar meaning word and/or an equivalent word of the participle exists in another participle set according to a preset similar meaning word bank and/or an equivalent word bank, wherein the equivalent word of the participle is a word which can be mutually replaced with the participle;
and if so, determining the distance from the participle to the nearest participle in the other participle set as a preset distance.
In an optional implementation manner, for each participle in the first participle set and the second participle set, determining a distance from the participle to a nearest participle in another participle set, including:
determining a word vector of the participle and a word vector of each participle in another participle set through a word2vec model trained in advance;
calculating a word vector distance between the word vector of the participle and the word vector of each participle in the other participle set, wherein the word vector distance comprises any one of cosine distance, Euclidean distance and Manhattan distance;
and determining the minimum value in the calculated distance of each word vector as the distance from the word to the word closest to the word in the other word segmentation set.
In an optional implementation manner, for each participle in the first participle set and the second participle set, obtaining a weight of the participle includes:
and inquiring in a preset word weight standard library to obtain the weight of the participle.
In an optional implementation manner, performing word segmentation processing on the first text and the second text respectively includes:
based on a preset standard word bank and an atom word bank, performing word segmentation processing on the two texts respectively;
wherein, the atomic phrases included in the atomic word stock are complete phrases which can not be inserted into other sentences.
In an optional implementation manner, before performing the word segmentation processing on the first text and the second text, the method further includes:
and cleaning the data of the two texts by adopting a preset data cleaning algorithm.
In a second aspect, the present application provides a text processing method, including:
acquiring a question text;
according to the text similarity determining method shown in the first aspect, the similarity between the question text and at least one preset text is determined, and a target preset text with the highest similarity is obtained;
and distributing the question texts based on the target preset texts.
In an optional implementation manner, the preset text is a channel attribute description text;
based on the target preset text, the problem text is distributed and processed, and the method comprises the following steps:
acquiring a target channel to which the target channel attribute description text belongs;
the question text is assigned to the target channel.
In an optional implementation manner, the preset text is a responsible person responsibility description text and/or a responsible person historical problem text;
based on the target preset text, the problem text is distributed and processed, and the method comprises the following steps:
acquiring a target responsible person responsibility description text and/or a target responsible person to which a target responsible person historical problem text belongs;
the question text is assigned to the target principal.
In an optional implementation manner, when the preset texts are a responsibility description text and a historical problem text of the principal, determining similarities between the problem text and at least one preset text respectively to obtain a target preset text with the highest similarity includes:
determining the similarity between the question text and at least one responsibility description text to obtain at least one corresponding first similarity determination result, and determining the similarity between the question text and at least one responsibility history question text to obtain at least one corresponding second similarity determination result;
determining at least one corresponding integrated similarity determination result based on at least one pair of the corresponding first similarity determination result and the second similarity determination result;
and determining a similarity determination result with the highest similarity in at least one comprehensive similarity determination result to obtain a target responsible person duty description text with the highest similarity and a target responsible person historical problem text.
In an optional implementation manner, for each pair of the first similarity determination result and the second similarity determination result, determining a corresponding integrated similarity determination result based on the pair of the first similarity determination result and the second similarity determination result, including:
acquiring a first weight of the first similarity determination result and a second weight of the second similarity determination result;
and based on the first weight and the second weight, carrying out weighted summation on the first similarity determination result and the second similarity determination result to obtain a corresponding comprehensive similarity determination result.
In an alternative implementation, the question text is an acoustic question text, and after the question text is allocated to the target channel, the method further includes:
according to the text similarity determination method shown in the first aspect, the acoustic problem texts are clustered to generate formal problem texts.
In a third aspect, the present application provides a text similarity determination apparatus, including:
the text acquisition module is used for acquiring a first text and a second text of which the similarity is to be determined;
the word segmentation module is used for performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text;
the first obtaining and determining module is used for obtaining the weight of each participle in the first participle set and determining the distance from the participle to the nearest participle in the second participle set;
the second obtaining and determining module is used for obtaining the weight of each participle in the second participle set and determining the distance from the participle to the nearest participle in the first participle set;
and the similarity determining module is used for determining the similarity of the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set.
In an optional implementation manner, when the similarity determining module is configured to determine the similarity between the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the participle set, the similarity determining module is specifically configured to:
summing products of weights corresponding to the participles in the two participle sets and corresponding distances to obtain a first summation result;
summing the weights corresponding to the participles in the two participle sets to obtain a second summation result;
and dividing the first summation result and the second summation result to obtain a final distance for representing the similarity of the two texts.
In an optional implementation manner, when the similarity determination module is configured to sum products of weights and corresponding distances corresponding to respective participles in the two participle sets to obtain a first sum result, the similarity determination module is specifically configured to:
respectively extracting participles with weights larger than a first threshold value from the two participle sets to obtain corresponding participle sub-sets of the two participle sets;
if the two word segmentation sub-sets do not intersect, determining whether each word segmentation in the two word segmentation sets is similar to the word segmentation closest to the word segmentation set in the other word segmentation set, wherein the similarity of the two word segmentation means that the distance between the two word segmentation sets is smaller than a second threshold value;
for determining the participle similar to the participle closest to the participle in the other participle set, subtracting the weight of the participle from a preset numerical value to obtain the inverse weight of the participle;
for determining the participle which is not similar to the participle closest to the other participle set, taking the weight of the participle as the reversal weight of the participle;
and summing products of the inversion weights and the corresponding distances corresponding to the participles in the two participle sets to obtain a first summation result.
In an optional implementation manner, for each participle in the first participle set and the second participle set, the first obtaining and determining module and the second obtaining and determining module, when being configured to determine a distance from the participle to a nearest participle in another participle set, are specifically configured to:
determining whether a similar meaning word and/or an equivalent word of the participle exists in another participle set according to a preset similar meaning word bank and/or an equivalent word bank, wherein the equivalent word of the participle is a word which can be mutually replaced with the participle;
and if so, determining the distance from the participle to the nearest participle in the other participle set as a preset distance.
In an optional implementation manner, for each participle in the first participle set and the second participle set, the first obtaining and determining module and the second obtaining and determining module, when being configured to determine a distance from the participle to a nearest participle in another participle set, are specifically configured to:
determining a word vector of the participle and a word vector of each participle in another participle set through a word2vec model trained in advance;
calculating a word vector distance between the word vector of the participle and the word vector of each participle in the other participle set, wherein the word vector distance comprises any one of cosine distance, Euclidean distance and Manhattan distance;
and determining the minimum value in the calculated distance of each word vector as the distance from the word to the word closest to the word in the other word segmentation set.
In an optional implementation manner, for each participle in the first participle set and the second participle set, the first obtaining and determining module and the second obtaining and determining module, when being configured to obtain a weight of the participle, are specifically configured to:
and inquiring in a preset word weight standard library to obtain the weight of the participle.
In an optional implementation manner, when the word segmentation module is configured to perform word segmentation processing on the first text and the second text, the word segmentation module is specifically configured to:
based on a preset standard word bank and an atom word bank, performing word segmentation processing on the two texts respectively;
wherein, the atomic phrases included in the atomic word stock are complete phrases which can not be inserted into other sentences.
In an optional implementation manner, the text similarity determining apparatus may further include a data cleaning module, and before the word segmentation module performs word segmentation on the first text and the second text respectively, the data cleaning module is configured to perform data cleaning on the two texts by using a preset data cleaning algorithm.
In a fourth aspect, the present application provides a text processing apparatus, comprising:
the acquisition module is used for acquiring a question text;
a determining module, configured to determine, according to the text similarity determining method shown in the first aspect, similarities between the question text and at least one preset text, respectively, and obtain a target preset text with a highest similarity;
and the distribution module is used for distributing and processing the problem text based on the target preset text.
In an optional implementation manner, the preset text is a channel attribute description text;
the allocation module is specifically configured to, when allocating the problem text based on the target preset text, perform allocation processing on the problem text:
acquiring a target channel to which the target channel attribute description text belongs;
the question text is assigned to the target channel.
In an optional implementation manner, the preset text is a responsible person responsibility description text and/or a responsible person historical problem text;
the allocation module is specifically configured to, when allocating the problem text based on the target preset text, perform allocation processing on the problem text:
acquiring a target responsible person responsibility description text and/or a target responsible person to which a target responsible person historical problem text belongs;
the question text is assigned to the target principal.
In an optional implementation manner, when the preset text is a responsibility description text and a historical problem text of the responsible person, the determining module is specifically configured to, when determining similarities between the problem text and at least one preset text respectively to obtain a target preset text with the highest similarity,:
determining the similarity between the question text and at least one responsibility description text to obtain at least one corresponding first similarity determination result, and determining the similarity between the question text and at least one responsibility history question text to obtain at least one corresponding second similarity determination result;
determining at least one corresponding integrated similarity determination result based on at least one pair of the corresponding first similarity determination result and the second similarity determination result;
and determining a similarity determination result with the highest similarity in at least one comprehensive similarity determination result to obtain a target responsible person duty description text with the highest similarity and a target responsible person historical problem text.
In an optional implementation manner, the determining module, when configured to determine, for each pair of the first similarity determination result and the second similarity determination result, a corresponding integrated similarity determination result based on the pair of the first similarity determination result and the second similarity determination result, is specifically configured to:
acquiring a first weight of the first similarity determination result and a second weight of the second similarity determination result;
and based on the first weight and the second weight, carrying out weighted summation on the first similarity determination result and the second similarity determination result to obtain a corresponding comprehensive similarity determination result.
In an alternative implementation, the question text is an acoustic question text, the text processing apparatus may further include a generating module,
after the assignment module assigns the question text to the target channel, the generation module is configured to cluster the acoustic question text according to the text similarity determination method provided in any of the embodiments described above, and generate a formal question text.
In a fifth aspect, the present application provides an electronic device, comprising:
a processor and a memory, the memory storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method as set forth in the first aspect of the application.
In a sixth aspect, the present application further provides an electronic device, including:
a processor and a memory, the memory storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the method illustrated in the second aspect of the application.
In a seventh aspect, the present application provides a computer-readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method shown in the first aspect of the present application.
In an eighth aspect, the present application further provides a computer-readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method shown in the second aspect of the present application.
According to the text similarity determining method, the text processing method, the corresponding device and the equipment, the similarity between the two texts is determined by combining the angle of the first participle set and the angle of the second participle set, and the accuracy of a similarity determining result can be remarkably improved by combining the weight representing the importance of each participle.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a text similarity determining method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a text similarity determining apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application are described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
The embodiment of the application provides a text similarity determining method, and as shown in fig. 1, the method includes:
step S101: and acquiring a first text and a second text of which the similarity is to be determined.
Optionally, the first text and the second text are short texts. The short text refers to a text with a small number of text words, for example, a text with a number of text words within two hundred, and for example, a sentence is also a short text. Illustratively, a question posed by a user typically contains a small number of words, which may be a short text. The embodiments of the present application do not limit the field in which the text is applied or the language used.
In the embodiment of the application, the similarity of two short texts is determined, namely the similarity of the two short texts is determined.
It is to be understood that "first" and "second" in the first text and the second text only indicate that two different texts with a similarity to be determined are distinguished, and are not to be construed as a limitation on the text content or the word count.
Step S102: and performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text.
The word segmentation processing may also be referred to as word segmentation processing, and in practical applications, there are various word segmentation modes that can be adopted, and the embodiment of the present application is not limited herein.
In the embodiment of the present application, each obtained word segmentation set may include one or more words. For example, the first set of partial words may include M words { X1, X2, … …, Xm }, and the second set of partial words may include N words { Y1, Y2, … …, Yn }, where M and N are integers no less than 1.
It is to be understood that "first" and "second" in the first participle set and the second participle set only represent the distinction of different participle sets obtained for different texts and are not to be understood as a limitation on the number or content of words in the sets.
Step S103: and for each participle in the first participle set, acquiring the weight of the participle, and determining the distance from the participle to the nearest participle in the second participle set.
For example, for a participle X1 in the first participle set, a weight a1 corresponding to X1 is obtained, and a distance from X1 to a closest participle in the participles { Y1, Y2, … …, Yn } in the second participle set is determined, for example, if X1 is closest to Yn in the second participle set, a distance from X1 to Yn is obtained as a distance from X1 to the second participle set, and is denoted as D11 (i.e., a distance corresponding to X1).
The same processing as that of X1 is carried out on other participles { X2, … …, Xm } in the first participle set, and the weight { A2, … …, Am } and the corresponding distance { D12, … …, D1m } of each participle are obtained.
Step S104: and for each participle in the second participle set, acquiring the weight of the participle, and determining the distance from the participle to the nearest participle in the first participle set.
For example, for a participle Y1 in the second participle set, a weight B1 corresponding to Y1 is obtained, and a distance from Y1 to a closest participle in the participles { X1, X2, … …, Xm } in the first participle set is determined, for example, a distance from Y1 to Xm in the first participle set is closest, and then a distance from Y1 to Xm is obtained as a distance from Y1 to the first participle set, and is denoted as D21 (i.e., a distance corresponding to Y1).
The same processing as that of Y1 is performed on the other participles { Y2, … …, Yn } in the second participle set, so as to obtain the weight { B2, … …, Bn } and the corresponding distance { D22, … …, D2n } corresponding to each participle.
Step S105: and determining the similarity of the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set.
In the above example, the similarity between the first text and the second text is determined according to the weight { a1, a2, … …, Am } and the corresponding distance { D11, D12, … …, D1m } of each participle { X1, X2, … …, Xm } in the first participle set, and the weight { B1, B2, … …, Bn } and the corresponding distance { D21, D22, … …, D2n } of each participle { Y1, Y2, … …, Yn } in the second participle set.
According to the text similarity determining method provided by the embodiment of the application, the similarity between the two texts is determined by combining the angle of the first participle set and the angle of the second participle set, and the accuracy of the similarity determining result can be remarkably improved by combining the weight representing the importance of each participle.
In the embodiment of the present application, a feasible implementation manner is provided for step S105, and specifically, the implementation manner may include:
step S1051: and summing products of the weights corresponding to the participles in the two participle sets and the corresponding distances to obtain a first summation result.
For example, SUM of summations SUM1 of the products of the weights and the corresponding distances corresponding to the respective participles { X1, X2, … …, Xm } in the first participle set is:
SUM1= A1*D11+ A2* D12+……+ Am* D1m
SUM of summations SUM2 of products of weights and corresponding distances corresponding to the respective participles { Y1, Y2, … …, Yn } in the second set of participles is:
SUM2= B1* D21+ B2* D22+……+Bn* D2n
the SUM of SUM1 and SUM2 is the first summation result.
Step S1052: and summing the weights corresponding to the participles in the two participle sets to obtain a second summation result.
In the above example, the weighted-SUM WEIGHT _ SUM1 corresponding to each participle { X1, X2, … …, Xm } in the first participle set is:
WEIGHT_SUM1=A1+A2+……+Am
the weighted-SUM WEIGHT _ SUM2 for each participle { Y1, Y2, … …, Yn } in the second participle set is:
WEIGHT_SUM2= B1+B2+……+Bn
the SUM of WEIGHT _ SUM1 and WEIGHT _ SUM2 is the second summation result.
Step S1053: and dividing the first summation result and the second summation result to obtain a final distance for representing the similarity of the two texts.
I.e., final distance = (SUM 1+ SUM 2)/(WEIGHT _ SUM1+ WEIGHT _ SUM 2).
Text similarity = upper distance limit (the parameter may be set to a constant 100) -final distance.
According to the text similarity determining method provided by the embodiment of the application, the similarity between the two texts is determined by combining the angles of the first participle set and the second participle set, and the weights of all participles in the first text and the weights of all participles in the second text are combined at the same time, so that the accuracy of a similarity determining result can be obviously improved.
The following explains why the weight of each word can significantly improve the accuracy of the similarity algorithm. Specifically, the embodiment of the present application provides a word weight standard, which is to divide different weight values for words of different grades, and as an example, the more complete the part-of-speech composition of a word, the higher the weight; or the more prominent the purpose of the word, the higher the weight. Such as super-draft card canceling automatic renewal (weight: 0.9), partial refund (weight: 0.8), new person red envelope (weight 0.75), general noun (weight 0.6), etc. Further, the weighted values set according to the word weight criteria may be stored, for example, a word weight criteria library process may be established for storage. When the weight of each participle in the first participle set and the second participle set is obtained, the weight of the participle is directly obtained by searching in a preset word weight standard library, and the larger the weight is, the more the weight is the key word in the text. The similarity between the texts is calculated based on the weight, so that the calculation proportion of the key words is strengthened, and the calculation proportion of the non-key words is weakened, and the calculation accuracy is improved.
The inventor of the application finds that, through analysis of a large amount of short text data, the short text generally has the following rule:
a. each text will typically have one or more (mostly one) central ideas.
b. Two texts are dissimilar if their central ideas are different.
Based on such findings, the present application provides a weight inverse policy (rule), which may be executed with respect to step S1051, and specifically, step S1051 may include:
step SA: and respectively extracting the participles with the weight larger than a first threshold value from the two participle sets to obtain corresponding participle sub-sets of the two participle sets.
The method comprises the steps of extracting participles with weights larger than a first threshold value from a first participle set to obtain a first participle subset, and extracting participles with weights larger than the first threshold value from a second participle set to obtain a second participle subset.
The first threshold is a weight threshold, and a person skilled in the art can set the first threshold according to an actual situation, which is not limited herein in this embodiment of the present application. Illustratively, the first threshold may be 0.6.
Because the participles with the weight larger than a certain threshold value are words with higher importance in the set, the central thought of the text can be reflected better. Therefore, in the embodiment of the present application, it is determined whether the central ideas of the first text and the second text are the same or similar by determining whether the first sub-set of words and the second sub-set of words have an intersection.
The intersection of the first sub-set of words and the second sub-set of words means that at least one word in the first sub-set of words and at least one word in the second sub-set of words are the same or similar; conversely, the first sub-set of words and the second sub-set of words do not intersect with each other, which means that no words in the first sub-set of words and the second sub-set of words are the same or similar.
If the first sub-set of words and the second sub-set of words intersect, the above step S1051 is directly performed.
If the first sub-set of words and the second sub-set of words do not intersect, the following steps are performed.
Step SB: if the two participle subsets do not intersect, whether each participle in the two participle sets is similar to the nearest participle in the other participle set or not is determined, and the similarity of the two participles means that the distance between the two participles is smaller than a second threshold value.
The second threshold is a distance threshold, and a person skilled in the art can set the second threshold according to an actual situation, which is not limited herein in the embodiment of the present application. Illustratively, the second threshold may be 0.05.
Illustratively, the first participle set comprises two words { W11, W12}, wherein the participle closest to W11 in the second participle set is W21, and the participle closest to W12 in the second participle set is W22.
In the embodiment of the present application, for the first set of words, it is determined whether W11 is similar to W21, and it is determined whether W12 is similar to W22, thereby reversing the weight calculation.
Specifically, assuming that W11 and W21 are similar, the following step SC is performed; assuming W12 and W22 are not similar, the following step SD is performed. The aim is to weaken similar words and strengthen dissimilar words so that the text as a whole is dissimilar. Under the condition that the central ideas of the two texts are different and not similar, the similarity of the two determined texts is lower, and the accuracy of the algorithm is enhanced.
Similarly, for the second word set, the inversion weight calculation is also performed by the same method, which is not described herein again.
Step SC: and for the participle determined to be similar to the participle with the closest distance in the other participle set, subtracting the weight of the participle from a preset numerical value to be used as the inversion weight of the participle.
The preset value can be set by a person skilled in the art according to actual conditions, and is not limited herein. Illustratively, the predetermined value may be 1.
In the above example, if it is determined that W11 is similar to W21, assuming that W11 corresponds to a weight a1 and the distance W11 corresponds to D11 (i.e., the distance between W11 and W21), the inverse weight of W11 is: 1-a 1.
Step SD: and for determining the participle which is not similar to the participle which is closest to the participle in the other participle set, taking the weight of the participle as the inversion weight of the participle.
In the previous example, if it is determined that W12 and W22 are not similar, assuming that the weight corresponding to W12 is a2 and the distance corresponding to W12 (i.e., the distance between W12 and W22) is D12, the inverse weight of W12 is: A2.
step SE: and summing products of the inversion weights and the corresponding distances corresponding to the participles in the two participle sets to obtain a first summation result.
In the above example, SUM1 of the product of the inverse weight and the corresponding distance corresponding to each participle { W11, W12} in the first participle set is:
SUM1=(1- A1)*D11+ A2* D12
similarly, the SUM2 of the product accumulation of the inverse weight and the corresponding distance corresponding to each participle in the second participle set is also implemented by the same method, and is not described herein again.
The SUM of SUM1 and SUM2 is the first summation result.
For further calculation of the second summation result and the final distance, reference may be made to the above description, which is not repeated here.
According to the text similarity determining method provided by the embodiment of the application, the accuracy of the similarity determining result can be further improved through a weight reverse strategy.
The embodiment of the present application further provides a synonym policy (rule) and an equivalent policy (rule), which may be executed for step S103 and step S104.
For the similar meaning word strategy, the inventor of the application finds that many words differ by one word or are different in terms but actually have similar or identical meanings through a great deal of analysis. For example: [ not show for evaluation ], [ charge fee, charge fee ], contact user, not use red envelope ], and the like. In the embodiment of the present application, the distance between two synonyms is defined as a preset distance. In practical applications, a person skilled in the art can set the preset distance according to practical situations, which is not limited herein. As an example, the preset distance may be 0.0001 close to 0. It is understood that since text similarity = upper distance limit (parameter set to constant 100) -final distance, the closer the preset distance between two synonyms is to 0, the closer the similarity between two synonyms is to 100.
For the equivalent word strategy, an equivalent word of a participle refers to a word that can be replaced with the participle. One scenario is that a word appears as a different word in the presence of an abbreviation or name change, but is actually a synonym. For example: [ superman, super Member ], [ rider, Knight ], etc. In the embodiment of the application, the manner of judging the equivalent words may be to try to replace the equivalent words when calculating the similarity of the long words, and see whether the long words are equal or not. And if the two equivalent words are equal, setting the distance between the two equivalent words as a preset distance. In practical applications, a person skilled in the art can set the preset distance according to practical situations, which is not limited herein. As an example, the preset distance may be 0.0001 close to 0. It is understood that since text similarity = upper distance limit (parameter set to constant 100) -final distance, the closer the preset distance between two synonyms is to 0, the closer the similarity between two synonyms is to 100.
In other embodiments, the preset distance between the similar meaning words and the preset distance between the equivalent words may be the same or different, that is, the preset similarity between the similar meaning words and the preset similarity between the equivalent words may be the same or different, and those skilled in the art may set the preset similarity according to actual situations, which is not limited herein.
In the embodiment of the application, the judgment of the similar meaning words can be realized by maintaining a preset similar meaning word library. Similarly, the judgment of the equivalent words can be realized by maintaining a preset equivalent word library.
Based on the similar meaning word strategy and the equivalent word strategy, aiming at each participle in the first participle set and the second participle set, the step of determining the distance from the participle to the nearest participle in the other participle set comprises the following steps:
determining whether the similar meaning words and/or the equivalent words of the participle exist in the other participle set or not according to a preset similar meaning word bank and/or an equivalent word bank; and if so, determining the distance from the participle to the nearest participle in the other participle set as a preset distance.
In practice, when determining the distance from one participle to the nearest participle in another participle set, only determining whether a near-synonym exists in the other participle set; or only judging whether equivalent words exist in another word segmentation set; and if the equivalent words or the similar words exist, directly acquiring the preset distance for calculation. Or whether an equivalent word exists in another word segmentation set or not may be judged first according to the priority order, if an equivalent word exists, the preset distance is directly obtained for calculation, if an equivalent word does not exist, whether an approximate word exists or not is judged, if an approximate word exists, the preset distance is directly obtained for calculation, and the specific judgment process is not limited in the embodiment of the present application.
In the embodiment of the application, after the distance from the participle to the nearest participle in the other participle set is determined as the preset distance, the distance from the participle to each participle in the other participle set does not need to be calculated according to the participle, so that the accuracy of the similarity determination result is improved, and meanwhile, the calculation resources are saved.
In the embodiment of the present application, a way of calculating a distance between two words is provided, which may be applied to calculate a distance between a word and a word closest to another word set when a word near-synonym and/or an equivalent does not exist in another word set, or may be applied to directly calculate a distance between each word and a word closest to another word set when a word near-synonym policy and/or an equivalent policy is not adopted.
Specifically, for each participle in the first participle set and the second participle set, determining the distance from the participle to the nearest participle in the other participle set, including:
determining a word vector of the participle and a word vector of each participle in another participle set through a word2vec model trained in advance; calculating a word vector distance between the word vector of the participle and the word vector of each participle in the other participle set; and determining the minimum value in the calculated distance of each word vector as the distance from the word to the word closest to the word in the other word segmentation set.
The common vector space distance comprises any one of a cosine distance, a Euclidean distance and a Manhattan distance. The embodiment of the application adopts the Euclidean distance.
In the embodiment of the application, the distance between the words is obtained by calculating the distance between the corresponding vectors of the words, namely the short text similarity calculation based on the word vectors is adopted in the scheme. Where a word vector means that each word maps to a real number representation. Word2Vec is used as a correlation model for generating Word vectors, and usually includes a neural network of shallow bilayers. Training is carried out on the basis of massive texts by utilizing the original word2vec, a trained special word2vec model is generated, and the model can take the space distance between word vectors into consideration.
An alternative distance calculation is described below, taking the cosine distance as an example. Cosine distance means that the distance between two word vectors is measured by measuring the cosine value of their included angle. The cosine value between two vectors can be found by using the dot product formula in Europe:
a·b = ||a|| ||b|| cosθ
given two word vectors a and B, the remaining chordal distance θ is given by the dot product and the vector length, as follows:
Figure 172680DEST_PATH_IMAGE001
a hereini、BiRepresenting the components of vectors a and B, respectively. The distance given ranges from-1 to 1, -1 means that the two word vectors point in exactly the opposite direction, 1 means that the two word vectors point exactly the same, and 0 usually means that there is independence between the two word vectors.
It should be noted that, in the prior art, any one of a cosine distance, an euclidean distance, and a manhattan distance may also be used to calculate the similarity, for example, after each sentence is cut into words, the i-th column vector column values of each word corresponding to the sentence are accumulated, and the formed vector is used as the sentence vector of the sentence. And then, the cosine distance, the Euclidean distance or the Manhattan distance is calculated by utilizing the sentence vector. But cosine similarity etc. in the prior art does not take into account the spatial distance relationship from word to word. In the embodiment of the application, on the basis that the word vector is obtained by the word2vec model trained in advance, the cosine distance, the Euclidean distance or the Manhattan distance are calculated, and the space distance between the words is considered, so that the semantic and context information of the words can be conveniently mined, and the calculation accuracy is improved.
In the embodiment of the present application, a feasible implementation manner is provided for step S102, and specifically, the step of performing word segmentation processing on the first text and the second text respectively may include: and respectively performing word segmentation processing on the two texts based on a preset standard word bank and an atom word bank.
The atomic phrases in the atomic word stock are complete phrases which cannot be inserted into other sentences, and the head or tail of the atomic phrases can be spliced with new words to form new phrases.
In the embodiment of the application, on the basis of the standard lexicon, the extended atomic lexicon formed by the atomic phrases is creatively extended and increased, corresponding weight is added to each atomic phrase, the lexicon is enriched and expanded, and the effective capture of key words can be realized in the word segmentation link, so that the accuracy of similarity calculation is improved.
The embodiment of the present application further provides an optional implementation manner, before performing word segmentation processing on the first text and the second text, the implementation manner may further include: and cleaning the data of the two texts by adopting a preset data cleaning algorithm.
Optionally, determining stop words in the two texts by adopting a preset data cleaning algorithm; based on the stop word, data cleansing is performed on both texts.
The stop words are words unrelated to text semantics, including but not limited to advertising words, noisy words, junk words, and the like, such as "hello" and the like. Stop words may also include words containing variables, such as the word "red envelope S-ary" containing the variable S, and the like. The stop words can be expanded by the person skilled in the art according to the actual situation and are also included in the scope of the present invention.
In other embodiments, the data cleansing may further include, but is not limited to, processing means such as text disambiguation, text format modification, and the like, and a person skilled in the art may set the data cleansing according to actual situations, which is not limited herein in this embodiment of the present application.
In the embodiment of the application, the efficiency of text similarity calculation can be improved through data cleaning.
An embodiment of the present application provides a text processing method, and as shown in fig. 2, the method includes:
step S201: and acquiring a question text.
The problem text refers to a corresponding text of a problem to be solved in a targeted manner, which is obtained in a manner such as complaints and feedback of the user.
Step S202: according to the text similarity determining method provided by any one of the embodiments, the similarity between the question text and at least one preset text is determined, and the target preset text with the highest similarity is obtained.
Specifically, the similarity between the question text and at least one preset text is determined, and at least one corresponding similarity determination result is obtained, and it can be understood that each preset text corresponds to one similarity determination result. And determining the similarity determination result with the highest similarity in the at least one similarity determination result and the corresponding target preset text, so as to obtain the target preset text with the highest similarity.
Step S203: and distributing the question texts based on the target preset texts.
The embodiment of the application provides a possible implementation manner, the preset text is a channel attribute description text, and each channel corresponds to one channel attribute description text. The channels are different channels divided according to the functions of each department, and each question text can be associated with the corresponding channel.
Step S203 may specifically include: acquiring a target channel to which the target channel attribute description text belongs; the question text is assigned to the target channel.
Based on the embodiment of the application, the problem corresponding to the problem text can be automatically allocated to the corresponding channel.
Further, after assigning the question text to the target channel, the method may further include: according to the text similarity determining method provided by any one of the embodiments, the acoustic problem texts are clustered to generate formal problem texts.
After the acoustic question text is audited, the response channels are matched, potential questions are mined from the acoustic question text, and the association is established as a formal question text.
The original sound problem text can collect massive original sound data such as work order data from monitoring customer service, information data complained or fed back by a client through a software interface, and the like.
It can be understood by those skilled in the art that the core of the clustering algorithm is to adopt a similarity measure to define a class cluster, specifically, input a plurality of acoustic question texts, perform the similarity measure by the text similarity determination method provided by any of the above embodiments, output all generated class clusters, and generate a corresponding formal question text based on each class cluster.
Optionally, text Clustering may be performed on the original sound problem text by using a DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and may also use other Clustering algorithms, which is not limited herein in this embodiment of the present application.
Based on the embodiment of the application, the automatic generation of the formal problem text based on the acoustic problem text can be realized, and the formal problem text is used for publishing, assigning, circulating and solving.
The embodiment of the application provides a possible implementation manner, the preset text is a responsibility description text and/or a historical problem text of a responsible person, after each problem is allocated to a channel, a corresponding problem responsible person is located in the channel, and each responsible person corresponds to one responsibility description text and/or one historical problem text of the responsible person.
Step S203 may specifically include: acquiring a target responsible person responsibility description text and/or a target responsible person to which a target responsible person historical problem text belongs; the question text is assigned to the target principal.
Specifically, when the preset text is a responsible person duty description text and a responsible person historical question text, determining the similarity between the question text and at least one preset text respectively to obtain a target preset text with the highest similarity, including:
determining the similarity between the question text and at least one responsibility description text to obtain at least one corresponding first similarity determination result, and determining the similarity between the question text and at least one responsibility history question text to obtain at least one corresponding second similarity determination result;
determining at least one corresponding integrated similarity determination result based on at least one pair of the corresponding first similarity determination result and the second similarity determination result;
and determining a similarity determination result with the highest similarity in at least one comprehensive similarity determination result to obtain a target responsible person duty description text with the highest similarity and a target responsible person historical problem text. It can be understood that the target responsible person responsibility description text and the target responsible person historical problem text correspond to the same target responsible person.
The question text used in the process may be an acoustic question text or a formal question text. If the original question text is used, the formal question text can be finally released to be the visible text to be accepted, and the problem responsible person in the channel accepts the text.
Specifically, for each respective pair of first and second similarity determination results, determining a corresponding integrated similarity determination result based on the pair of first and second similarity determination results, comprising:
acquiring a first weight of the first similarity determination result and a second weight of the second similarity determination result;
and based on the first weight and the second weight, carrying out weighted summation on the first similarity determination result and the second similarity determination result to obtain a corresponding comprehensive similarity determination result.
In practical applications, a person skilled in the art may set the values of the first weight and the second weight according to practical situations, which is not limited herein.
Based on the embodiment of the application, the problem corresponding to the problem text can be automatically distributed to the corresponding responsible person in the problem channel.
In the following, a service platform capable of providing experience services is taken as an application scenario to illustrate the application of the technical solution provided by the embodiment of the present invention.
When the user enjoys experience service on the service platform, the user can consult customer service if any problem exists. The service platform can monitor the work order data of the customer service and collect the problem text Q corresponding to the user problem.
There are 3 channels in the service platform, channel 1, channel 2, and channel 3, each having different functions. By the text similarity determining method provided by the scheme of the application, the similarity of the question text Q and the channel attribute description texts of 3 channels is respectively calculated. The question text Q is found to have the highest similarity with the channel attribute description text of channel 2, and the question text Q is assigned to channel 2.
There are two problem responsible persons in channel 2, responsible person 1 and responsible person 2, respectively, each responsible person has different responsibilities and different supervision problems. According to the text similarity determining method provided by the scheme, the similarity between the problem text Q and the responsibility description text of the responsible person of 2 responsible persons and the historical problem text of the responsible person is calculated respectively, two similarity results are obtained for each responsible person, the two similarity results are weighted and summed and then compared, the final similarity result of the responsible person 2 is found to be high, and the problem text Q is distributed to the responsible person 2 for processing.
Through the technical scheme provided by the embodiment of the application, correct distribution of the problem texts can be realized, so that the problem solving efficiency is improved, and the user experience is improved.
An embodiment of the present application provides a text similarity determining apparatus, as shown in fig. 3, the text similarity determining apparatus 30 may include: a text acquisition module 301, a word segmentation module 302, a first acquisition and determination module 303, a second acquisition and determination module 304, and a similarity determination module 305, wherein,
the text acquisition module 301 is configured to acquire a first text and a second text with similarity to be determined;
the word segmentation module 302 is configured to perform word segmentation on the first text and the second text, respectively, to obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text;
the first obtaining and determining module 303 is configured to obtain, for each participle in the first participle set, a weight of the participle, and determine a distance from the participle to a closest participle in the second participle set;
the second obtaining and determining module 304 is configured to, for each participle in the second participle set, obtain a weight of the participle, and determine a distance from the participle to a closest participle in the first participle set;
the similarity determining module 305 is configured to determine similarity between the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set.
In an optional implementation manner, when the similarity determining module 305 is configured to determine the similarity between the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set, specifically configured to:
summing products of weights corresponding to the participles in the two participle sets and corresponding distances to obtain a first summation result;
summing the weights corresponding to the participles in the two participle sets to obtain a second summation result;
and dividing the first summation result and the second summation result to obtain a final distance for representing the similarity of the two texts.
In an alternative implementation manner, the similarity determination module 305 is specifically configured to, when configured to sum products of weights and corresponding distances corresponding to respective participles in two participle sets to obtain a first summation result:
respectively extracting participles with weights larger than a first threshold value from the two participle sets to obtain corresponding participle sub-sets of the two participle sets;
if the two word segmentation sub-sets do not intersect, determining whether each word segmentation in the two word segmentation sets is similar to the word segmentation closest to the word segmentation set in the other word segmentation set, wherein the similarity of the two word segmentation means that the distance between the two word segmentation sets is smaller than a second threshold value;
for determining the participle similar to the participle closest to the participle in the other participle set, subtracting the weight of the participle from a preset numerical value to obtain the inverse weight of the participle;
for determining the participle which is not similar to the participle closest to the other participle set, taking the weight of the participle as the reversal weight of the participle;
and summing products of the inversion weights and the corresponding distances corresponding to the participles in the two participle sets to obtain a first summation result.
In an optional implementation manner, for each participle in the first participle set and the second participle set, the first obtaining and determining module 303 and the second obtaining and determining module 304 are specifically configured to, when configured to determine a distance from the participle to a nearest participle in another participle set,:
determining whether a similar meaning word and/or an equivalent word of the participle exists in another participle set according to a preset similar meaning word bank and/or an equivalent word bank, wherein the equivalent word of the participle is a word which can be mutually replaced with the participle;
and if so, determining the distance from the participle to the nearest participle in the other participle set as a preset distance.
In an optional implementation manner, for each participle in the first participle set and the second participle set, the first obtaining and determining module 303 and the second obtaining and determining module 304 are specifically configured to, when configured to determine a distance from the participle to a nearest participle in another participle set,:
determining a word vector of the participle and a word vector of each participle in another participle set through a word2vec model trained in advance;
calculating a word vector distance between the word vector of the participle and the word vector of each participle in the other participle set, wherein the word vector distance comprises any one of cosine distance, Euclidean distance and Manhattan distance;
and determining the minimum value in the calculated distance of each word vector as the distance from the word to the word closest to the word in the other word segmentation set.
In an optional implementation manner, for each participle in the first participle set and the second participle set, the first obtaining and determining module 303 and the second obtaining and determining module 304 are specifically configured to, when configured to obtain a weight of the participle:
and inquiring in a preset word weight standard library to obtain the weight of the participle.
In an optional implementation manner, when the word segmentation module 302 is configured to perform word segmentation on the first text and the second text, specifically:
based on a preset standard word bank and an atom word bank, performing word segmentation processing on the two texts respectively;
wherein, the atomic phrases included in the atomic word stock are complete phrases which can not be inserted into other sentences.
In an optional implementation manner, the text similarity determining apparatus 30 may further include a data cleaning module, and before the word segmentation module 302 performs word segmentation processing on the first text and the second text, the data cleaning module is configured to perform data cleaning on the two texts by using a preset data cleaning algorithm.
It can be clearly understood by those skilled in the art that the implementation principle and the generated technical effect of the text similarity determining apparatus provided in the embodiment of the present application are the same as those of the foregoing method embodiment, and for convenience and brevity of description, corresponding contents in the foregoing method embodiment may be referred to where no part of the apparatus embodiment is mentioned, and are not repeated herein.
An embodiment of the present application further provides a text processing apparatus, as shown in fig. 4, the text processing apparatus 40 may include: an acquisition module 401, a determination module 402, and an assignment module 403, wherein,
the obtaining module 401 is configured to obtain a question text;
the determining module 402 is configured to determine similarities between the question text and at least one preset text respectively according to the text similarity determining method provided in any one of the embodiments, so as to obtain a target preset text with the highest similarity;
the allocating module 403 is configured to allocate the question text based on the target preset text.
In an optional implementation manner, the preset text is a channel attribute description text;
the allocating module 403, when configured to allocate the question text based on the target preset text, is specifically configured to:
acquiring a target channel to which the target channel attribute description text belongs;
the question text is assigned to the target channel.
In an optional implementation manner, the preset text is a responsible person responsibility description text and/or a responsible person historical problem text;
the allocating module 403, when configured to allocate the question text based on the target preset text, is specifically configured to:
acquiring a target responsible person responsibility description text and/or a target responsible person to which a target responsible person historical problem text belongs;
the question text is assigned to the target principal.
In an optional implementation manner, when the preset texts are a responsibility description text and a historical problem text of the principal, the determining module 402 is specifically configured to, when determining similarities between the problem text and at least one preset text respectively to obtain a target preset text with the highest similarity,:
determining the similarity between the question text and at least one responsibility description text to obtain at least one corresponding first similarity determination result, and determining the similarity between the question text and at least one responsibility history question text to obtain at least one corresponding second similarity determination result;
determining at least one corresponding integrated similarity determination result based on at least one pair of the corresponding first similarity determination result and the second similarity determination result;
and determining a similarity determination result with the highest similarity in at least one comprehensive similarity determination result to obtain a target responsible person duty description text with the highest similarity and a target responsible person historical problem text.
In an optional implementation manner, the determining module 402, when configured to determine, for each pair of the first similarity determination result and the second similarity determination result, a corresponding integrated similarity determination result based on the pair of the first similarity determination result and the second similarity determination result, is specifically configured to:
acquiring a first weight of the first similarity determination result and a second weight of the second similarity determination result;
and based on the first weight and the second weight, carrying out weighted summation on the first similarity determination result and the second similarity determination result to obtain a corresponding comprehensive similarity determination result.
In an alternative implementation, the question text is an acoustic question text, the text processing apparatus 40 may further include a generating module,
after the assignment module 403 assigns the question text to the target channel, the generation module is configured to cluster the acoustic question texts according to the text similarity determination method provided in any of the embodiments described above, and generate a formal question text.
It can be clearly understood by those skilled in the art that the implementation principle and the generated technical effect of the text processing apparatus provided in the embodiment of the present application are the same as those of the foregoing method embodiment, and for convenience and brevity of description, corresponding contents in the foregoing method embodiment may be referred to where no part of the apparatus embodiment is mentioned, and are not repeated herein.
The modules described in the embodiments of the present application may be implemented in software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.
By way of example, the text similarity determination apparatus or the text processing apparatus provided in the embodiments of the present application may be a computer program (including program code) running in a computer device, for example, the text similarity determination apparatus or the text processing apparatus is a component or a module of an application program; the device can be used for executing the corresponding content of the user side in the method embodiment; or the device can be used for executing the corresponding content of the server side in the foregoing method embodiments.
In some embodiments, the text similarity determination apparatus or the text processing apparatus provided in the embodiments of the present Application may be implemented by a combination of hardware and software, and by way of example, the text similarity determination apparatus or the text processing apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to perform the above-mentioned methods provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
Based on the same principle as the method shown in the embodiments of the present application, there is also provided in the embodiments of the present application an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing a computer program; and the processor is used for executing the text similarity determination method or the text processing method shown in any embodiment of the application by calling the computer program.
In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 500 shown in fig. 5 comprising: a processor 501 and a memory 503. Wherein the processor 501 is coupled to the memory 503, such as via the bus 502. Optionally, the electronic device 500 may further include a transceiver 504, and the transceiver 504 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 504 is not limited to one in practical applications, and the structure of the electronic device 500 is not limited to the embodiment of the present application.
The Processor 501 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 501 may also be a combination of implementing computing functionality, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, and the like.
Bus 502 may include a path that transfers information between the above components. The bus 502 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 502 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
The Memory 503 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 503 is used for storing application program codes (computer programs) for executing the present application, and is controlled by the processor 501 for execution. The processor 501 is configured to execute application program code stored in the memory 503 to implement the content shown in the foregoing method embodiments.
The electronic device may also be a terminal device, and the electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the application scope of the embodiments of the present application.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.
According to another aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the corresponding content in the foregoing method embodiment.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It should be understood that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer readable storage medium provided by the embodiments of the present application may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (17)

1. A text similarity determination method is characterized by comprising the following steps:
acquiring a first text and a second text of which the similarity is to be determined;
performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text;
for each participle in the first participle set, acquiring the weight of the participle, and determining the distance from the participle to the nearest participle in the second participle set;
for each participle in the second participle set, acquiring the weight of the participle, and determining the distance from the participle to the nearest participle in the first participle set;
determining the similarity between the first text and the second text according to the weight and the corresponding distance corresponding to each participle in the first participle set and the weight and the corresponding distance corresponding to each participle in the second participle set;
under the condition that the participle subsets corresponding to the two participle sets do not intersect, the weight corresponding to the first participle in the two participle sets adopts the difference between a preset numerical value and the weight of the first participle, the two participle subsets are obtained by respectively extracting participles with weights larger than a first threshold value from the two participle sets, and the first participle is a participle similar to the nearest participle in the other participle set.
2. The method for determining the similarity of texts according to claim 1, wherein the determining the similarity of the first text and the second text according to the corresponding weight and the corresponding distance of each participle in the first participle set and the corresponding weight and the corresponding distance of each participle in the second participle set comprises:
summing products of weights corresponding to the participles in the two participle sets and corresponding distances to obtain a first summation result;
summing the weights corresponding to the participles in the two participle sets to obtain a second summation result;
and dividing the first summation result and the second summation result to obtain a final distance for representing the similarity of the two texts.
3. The method for determining text similarity according to claim 2, wherein summing products of weights and corresponding distances corresponding to respective participles in the two participle sets to obtain a first summation result comprises:
if the two participle subsets do not intersect, determining whether each participle in the two participle sets is similar to a nearest participle in the other participle set or not, wherein the similarity of the two participles means that the distance between the two participles is smaller than a second threshold value;
for determining the participle similar to the participle closest to the participle in the other participle set, subtracting the weight of the participle from a preset numerical value to obtain the inverse weight of the participle;
for determining the participle which is not similar to the participle closest to the other participle set, taking the weight of the participle as the reversal weight of the participle;
and summing products of the inversion weights and the corresponding distances corresponding to the participles in the two participle sets to obtain the first summation result.
4. The text similarity determination method according to any one of claims 1 to 3, wherein determining, for each participle in the first participle set and the second participle set, a distance of the participle to a closest participle in another participle set comprises:
determining whether a similar meaning word and/or an equivalent word of the participle exists in the other participle set according to a preset similar meaning word bank and/or an equivalent word bank, wherein the equivalent word of the participle is a word which can be mutually replaced with the participle;
and if so, determining the distance from the participle to the nearest participle in the other participle set as a preset distance.
5. The text similarity determination method according to any one of claims 1 to 3, wherein determining, for each participle in the first participle set and the second participle set, a distance of the participle to a closest participle in another participle set comprises:
determining a word vector of the participle and a word vector of each participle in another participle set through a word2vec model trained in advance;
calculating a word vector distance between the word vector of the participle and the word vector of each participle in another participle set, wherein the word vector distance comprises any one of cosine distance, Euclidean distance and Manhattan distance;
and determining the minimum value in the calculated distance of each word vector as the distance from the word to the word closest to the word in the other word segmentation set.
6. The text similarity determination method according to any one of claims 1 to 3, wherein obtaining the weight of each participle in the first participle set and the second participle set comprises:
and inquiring in a preset word weight standard library to obtain the weight of the participle.
7. The text similarity determination method according to any one of claims 1 to 3, wherein performing word segmentation processing on the first text and the second text respectively comprises:
based on a preset standard word bank and an atom word bank, performing word segmentation processing on the two texts respectively;
wherein, the atomic phrases included in the atomic word stock are complete phrases which can not be inserted into other sentences.
8. A method of text processing, comprising:
acquiring a question text;
the text similarity determination method according to any one of claims 1 to 7, wherein the similarity between the question text and at least one preset text is determined to obtain a target preset text with the highest similarity;
and distributing the question text based on the target preset text.
9. The text processing method according to claim 8, wherein the predetermined text is a channel attribute description text;
the allocating the question text based on the target preset text comprises the following steps:
acquiring a target channel to which the target channel attribute description text belongs;
and allocating the question text to the target channel.
10. The text processing method according to claim 8, wherein the preset text is a responsible person responsibility description text and/or a responsible person historical problem text;
the allocating the question text based on the target preset text comprises the following steps:
acquiring a target responsible person responsibility description text and/or a target responsible person to which a target responsible person historical problem text belongs;
assigning the question text to the target principal.
11. The method of claim 10, wherein when the preset texts are a responsibility description text and a historical problem text of a principal, the determining the similarity between the problem text and at least one preset text respectively to obtain a target preset text with the highest similarity comprises:
determining the similarity between the question text and at least one responsibility description text to obtain at least one corresponding first similarity determination result, and determining the similarity between the question text and at least one responsibility history question text to obtain at least one corresponding second similarity determination result;
determining at least one corresponding integrated similarity determination result based on at least one pair of the corresponding first similarity determination result and the second similarity determination result;
and determining the similarity determination result with the highest similarity in the at least one comprehensive similarity determination result to obtain a target responsible person duty description text with the highest similarity and a target responsible person historical problem text.
12. The text processing method of claim 11, wherein determining, for each pair of the first similarity determination result and the second similarity determination result, a corresponding integrated similarity determination result based on the pair of the first similarity determination result and the second similarity determination result comprises:
acquiring a first weight of the first similarity determination result and a second weight of the second similarity determination result;
and based on the first weight and the second weight, carrying out weighted summation on the first similarity determination result and the second similarity determination result to obtain a corresponding comprehensive similarity determination result.
13. The method of claim 9, wherein the question text is an acoustic question text, and wherein after assigning the question text to the target channel, the method further comprises:
the text similarity determination method according to any one of claims 1 to 7, clustering the acoustic question texts to generate formal question texts.
14. A text similarity determination apparatus, comprising:
the text acquisition module is used for acquiring a first text and a second text of which the similarity is to be determined;
the word segmentation module is used for performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation set corresponding to the first text and a second word segmentation set corresponding to the second text;
the first obtaining and determining module is used for obtaining the weight of each participle in the first participle set and determining the distance from the participle to the nearest participle in the second participle set;
the second obtaining and determining module is used for obtaining the weight of each participle in the second participle set and determining the distance from the participle to the nearest participle in the first participle set;
a similarity determining module, configured to determine similarity between the first text and the second text according to a weight and a corresponding distance corresponding to each participle in the first participle set and a weight and a corresponding distance corresponding to each participle in the second participle set;
under the condition that the participle subsets corresponding to the two participle sets do not intersect, the weight corresponding to the first participle in the two participle sets adopts the difference between a preset numerical value and the weight of the first participle, the two participle subsets are obtained by respectively extracting participles with weights larger than a first threshold value from the two participle sets, and the first participle is a participle similar to the nearest participle in the other participle set.
15. A text processing apparatus, comprising:
the acquisition module is used for acquiring a question text;
a determining module, configured to determine similarities between the question text and at least one preset text according to the text similarity determining method according to any one of claims 1 to 7, so as to obtain a target preset text with the highest similarity;
and the distribution module is used for distributing and processing the question text based on the target preset text.
16. An electronic device, characterized in that the electronic device comprises:
a processor and a memory storing at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by the processor to implement the method of any of claims 1 to 7 or claims 8 to 13.
17. A computer-readable storage medium, characterized in that it stores at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the method of any of claims 1 to 7 or 8 to 13.
CN202111620649.XA 2021-12-28 2021-12-28 Text similarity determining method, text processing method, corresponding device and equipment Active CN113988045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111620649.XA CN113988045B (en) 2021-12-28 2021-12-28 Text similarity determining method, text processing method, corresponding device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111620649.XA CN113988045B (en) 2021-12-28 2021-12-28 Text similarity determining method, text processing method, corresponding device and equipment

Publications (2)

Publication Number Publication Date
CN113988045A CN113988045A (en) 2022-01-28
CN113988045B true CN113988045B (en) 2022-04-12

Family

ID=79734757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111620649.XA Active CN113988045B (en) 2021-12-28 2021-12-28 Text similarity determining method, text processing method, corresponding device and equipment

Country Status (1)

Country Link
CN (1) CN113988045B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116662374B (en) * 2023-07-31 2023-10-20 天津市扬天环保科技有限公司 Information technology consultation service system based on correlation analysis

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555093A (en) * 2018-03-30 2019-12-10 华为技术有限公司 text matching method, device and equipment
CN111026840A (en) * 2019-11-26 2020-04-17 腾讯科技(深圳)有限公司 Text processing method, device, server and storage medium
CN111144109A (en) * 2019-12-27 2020-05-12 北京明略软件系统有限公司 Text similarity determination method and device
CN111782803A (en) * 2020-06-05 2020-10-16 京东数字科技控股有限公司 Work order processing method and device, electronic equipment and storage medium
CN112364620A (en) * 2020-11-06 2021-02-12 中国平安人寿保险股份有限公司 Text similarity judgment method and device and computer equipment
CN112733520A (en) * 2020-12-30 2021-04-30 望海康信(北京)科技股份公司 Text similarity calculation method and system, corresponding equipment and storage medium
WO2021169111A1 (en) * 2020-02-28 2021-09-02 平安国际智慧城市科技股份有限公司 Resume screening method and apparatus, computer device and storage medium
CN113761866A (en) * 2020-09-23 2021-12-07 西安京迅递供应链科技有限公司 Event processing method, device, server and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11734322B2 (en) * 2019-11-18 2023-08-22 Intuit, Inc. Enhanced intent matching using keyword-based word mover's distance
CN112926308B (en) * 2021-02-25 2024-01-12 北京百度网讯科技有限公司 Method, device, equipment, storage medium and program product for matching text
CN113011172B (en) * 2021-03-15 2023-08-22 腾讯科技(深圳)有限公司 Text processing method, device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555093A (en) * 2018-03-30 2019-12-10 华为技术有限公司 text matching method, device and equipment
CN111026840A (en) * 2019-11-26 2020-04-17 腾讯科技(深圳)有限公司 Text processing method, device, server and storage medium
CN111144109A (en) * 2019-12-27 2020-05-12 北京明略软件系统有限公司 Text similarity determination method and device
WO2021169111A1 (en) * 2020-02-28 2021-09-02 平安国际智慧城市科技股份有限公司 Resume screening method and apparatus, computer device and storage medium
CN111782803A (en) * 2020-06-05 2020-10-16 京东数字科技控股有限公司 Work order processing method and device, electronic equipment and storage medium
CN113761866A (en) * 2020-09-23 2021-12-07 西安京迅递供应链科技有限公司 Event processing method, device, server and medium
CN112364620A (en) * 2020-11-06 2021-02-12 中国平安人寿保险股份有限公司 Text similarity judgment method and device and computer equipment
CN112733520A (en) * 2020-12-30 2021-04-30 望海康信(北京)科技股份公司 Text similarity calculation method and system, corresponding equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Text Similarity Measurement of Semantic Cognition Based on Word Vector Distance Decentralization With Clustering Analysis;Shenghan Zhou等;《IEEE Access》;20190801;第107247-107258页 *
文本相似度计算方法研究综述;王春柳等;《情报科学》;20190331;第158-168页 *

Also Published As

Publication number Publication date
CN113988045A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
US11023682B2 (en) Vector representation based on context
CN111914551B (en) Natural language processing method, device, electronic equipment and storage medium
CN111932386B (en) User account determining method and device, information pushing method and device, and electronic equipment
CN106469192B (en) Text relevance determining method and device
US11893491B2 (en) Compound model scaling for neural networks
US11238027B2 (en) Dynamic document reliability formulation
CN104036259A (en) Face similarity recognition method and system
CN113988045B (en) Text similarity determining method, text processing method, corresponding device and equipment
JP2023550194A (en) Model training methods, data enrichment methods, equipment, electronic equipment and storage media
CN113887213A (en) Event detection method and device based on multilayer graph attention network
CN106997340B (en) Word stock generation method and device and document classification method and device using word stock
CN113254620B (en) Response method, device and equipment based on graph neural network and storage medium
CN113821588A (en) Text processing method and device, electronic equipment and storage medium
CN111428486A (en) Article information data processing method, apparatus, medium, and electronic device
CN110968690B (en) Clustering division method and device for words, equipment and storage medium
CN112256841B (en) Text matching and countermeasure text recognition method, device and equipment
CN112989040B (en) Dialogue text labeling method and device, electronic equipment and storage medium
US11734602B2 (en) Methods and systems for automated feature generation utilizing formula semantification
CN114676677A (en) Information processing method, information processing apparatus, server, and storage medium
Nishino et al. The Cucconi statistic for Type-I censored data
US11586973B2 (en) Dynamic source reliability formulation
CN109285559B (en) Role transition point detection method and device, storage medium and electronic equipment
CN112287663A (en) Text parsing method, equipment, terminal and storage medium
CN111126617A (en) Method, device and equipment for selecting fusion model weight parameters
CN116383367B (en) Data processing method, device, equipment and medium for cold start stage of dialogue system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant