CN113535927A - Method, medium, device and computing equipment for acquiring similar texts - Google Patents

Method, medium, device and computing equipment for acquiring similar texts Download PDF

Info

Publication number
CN113535927A
CN113535927A CN202110871649.0A CN202110871649A CN113535927A CN 113535927 A CN113535927 A CN 113535927A CN 202110871649 A CN202110871649 A CN 202110871649A CN 113535927 A CN113535927 A CN 113535927A
Authority
CN
China
Prior art keywords
text
standard
similar
mapping
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110871649.0A
Other languages
Chinese (zh)
Inventor
杨萌
冯旻伟
尹竞成
黄旭
阮良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Zhiqi Technology Co Ltd
Original Assignee
Hangzhou Netease Zhiqi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Netease Zhiqi Technology Co Ltd filed Critical Hangzhou Netease Zhiqi Technology Co Ltd
Priority to CN202110871649.0A priority Critical patent/CN113535927A/en
Publication of CN113535927A publication Critical patent/CN113535927A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The disclosure provides a method, medium, device and computing equipment for acquiring similar texts. The text features of the standard text are determined not only based on a set of vectors mapped by the standard text, but also based on a set of vectors mapped by semantic role labeling results of individual words in the labeled text. And inputting the determined text characteristics into a similar text generation model to obtain at least one similar text.

Description

Method, medium, device and computing equipment for acquiring similar texts
Technical Field
The embodiment of the disclosure relates to the technical field of information, in particular to a method, a medium, a device and a computing device for acquiring similar texts.
Background
This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In some application scenarios, several similar texts need to be acquired based on the standard text. For example, in an application scenario in which a user interacts with a customer service system, a specific text expression of a question input by the user to the customer service system is non-standardized, and in order to improve the intelligence degree of the customer service system, it is often necessary to deploy a plurality of similar questions acquired based on standard questions in the customer service system, and match similar questions corresponding to the same standard question to the same standard answer.
However, an effective technical solution for acquiring similar texts is lacking at present.
Disclosure of Invention
In this context, embodiments of the present disclosure are intended to provide a method, medium, apparatus, and computing device for obtaining similar text, so as to obtain more effective similar text based on standard text.
In a first aspect of embodiments of the present disclosure, a method for acquiring similar texts is provided, including:
acquiring a standard text;
determining text features of the standard text, including: mapping the standard text into a vector set; performing semantic role labeling on each word in the standard text, and mapping a labeling result into a vector set; determining the text characteristics of the standard text according to the two vector sets obtained by mapping;
inputting the text features of the standard text into a similar text generation model, and outputting at least one similar text.
In one embodiment of the present disclosure, the similar text generation model is constructed using a SimBERT algorithm; or the similar text generation model is constructed by a multi-head attention mechanism algorithm; or the similar text generation model is constructed by adopting a recurrent neural network algorithm.
In another embodiment of the present disclosure, mapping the annotation result to a vector set includes:
if the standard text comprises at least two sentences with independent semantic structures, mapping the part corresponding to the sentence in the labeling result into a vector corresponding to the sentence aiming at each sentence with an independent semantic structure;
and forming vector sets by vectors respectively corresponding to the statements with the independent semantic structures, or combining the vectors respectively corresponding to the statements with the independent semantic structures into one vector.
In yet another embodiment of the present disclosure, mapping the annotation result to a set of vectors includes:
for each statement with an independent semantic structure included in the standard text, wherein the statement comprises N words, and mapping a part corresponding to the statement in the labeling result into an N-dimensional vector; and the dimensions in the N-dimensional vector correspond to the words in the sentence one by one, and the value of any dimension is determined based on the semantic role of the word corresponding to the dimension.
In another embodiment of the present disclosure, determining the text feature of the standard text according to two vector sets obtained by mapping includes:
and forming a new vector set by the two vector sets obtained by mapping, wherein the new vector set is used as the text characteristic of the standard text.
In yet another embodiment of the present disclosure, the similar text generation model is trained by:
acquiring a training sample set, wherein each training sample comprises a first type of text and a second type of text; the first type of text and the second type of text included in the same training sample have the same content meaning;
for each training sample, determining text features of each text included in the training sample, including: mapping the text into a set of vectors; performing semantic role labeling on each word in the text, and mapping a labeling result into a vector set; determining the text characteristics of the text according to the two vector sets obtained by mapping;
and training a similar text generation model by taking the text features of the first type of text included in the training sample as model input and the text features of the second type of text included in the training sample as model output.
In still another embodiment of the present disclosure, the method further includes:
calling a first translation tool to translate the standard text of the first language version into a text of a second language version;
and calling a second translation tool to translate the text of the second language version back to the text of the first language version as similar text.
In still another embodiment of the present disclosure, the method further includes:
taking the standard text as a search object, and calling a search engine to search;
and taking a plurality of search results which are specified by the search engine and have similar meanings with the search object as similar texts.
In another embodiment of the present disclosure, applied to a customer service system, the method further includes:
displaying a cold start configuration interface;
acquiring a plurality of similar problems corresponding to the standard problems input into the configuration interface;
and during the cold starting process, associating at least part of the acquired similar questions with the standard answers corresponding to the standard questions.
In still another embodiment of the present disclosure, the method further includes:
displaying the obtained at least one similar question through the configuration interface;
in response to a selection signal for the presented similar text, determining a selected similar question;
associating at least part of the obtained similar questions with the standard answers corresponding to the standard questions, wherein the method comprises the following steps:
and associating the selected similar questions with the standard answers corresponding to the standard question texts.
In yet another embodiment of the present disclosure, the at least one similar text obtained is presented, including:
calculating the similarity of each obtained similar problem and the standard problem;
and displaying the acquired at least one similar text in a sequence form according to the sequence of the similarity from big to small.
In a second aspect of the disclosed embodiments, there is provided an apparatus for acquiring similar text, comprising:
the standard text acquisition module is used for acquiring a standard text;
the text characteristic determining module is used for determining the text characteristic of the standard text and comprises the following steps: mapping the standard text into a vector set; performing semantic role labeling on each word in the standard text, and mapping a labeling result into a vector set; determining the text characteristics of the standard text according to the two vector sets obtained by mapping;
and the similar text acquisition module inputs the text characteristics of the standard text into a similar text generation model and outputs at least one similar text.
In a third aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the method of acquiring similar text of any of the embodiments of the present disclosure.
In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising a memory, a processor; the memory is used for storing computer instructions executable on the processor, and the processor is used for realizing the method for acquiring similar texts of any embodiment of the disclosure when executing the computer instructions.
According to the method, the medium, the device and the computing equipment for obtaining the similar texts, the text characteristics of the standard texts are determined not only based on the vector set mapped by the standard texts, but also based on the vector set mapped by the semantic role labeling results of all words in the labeled texts. And inputting the determined text characteristics into a similar text generation model to obtain at least one similar text.
The text features of the standard text determined in this way can contain information on the aspect of the literal expression of the standard text and information on the aspect of the core semantic structure of the standard text. The similar text obtained by inputting the text features into the similar text generation model not only has similarity with the standard text in literal expression, but also inherits the core semantic structure of the standard text as much as possible, and is more effective similar text.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 illustratively provides an application scenario in which a user interacts with a customer service system;
FIG. 2 schematically provides a method flow for obtaining similar text;
FIG. 3 is an exemplary illustration of a specific classification table of core arguments and additional arguments;
FIG. 4 illustrates a cold start configuration interface;
FIG. 5 is an exemplary illustration of an apparatus for obtaining similar text;
FIG. 6 is a schematic diagram of a computer-readable storage medium provided by the present disclosure;
fig. 7 is a schematic structural diagram of a computing device provided by the present disclosure.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts. Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the disclosure, a method, a medium, a device and a computing device for acquiring similar texts are provided.
The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.
The inventors have found that for natural language processing tasks that generate corresponding similar text based on standard text, the key is how to define "similarity". In some specific application scenarios, the definition of similarity is that semantic information acquired by a standard text and a similar text respectively needs to be understood without a large deviation, however, the definition of similarity is not only ensured that the similar text is similar enough to the standard text in terms of literal expression to conform to the similar definition. Only similar texts similar to the standard texts in literal expression still have the possibility of losing the core semantics of the standard texts, resulting in larger semantic information deviation.
The specific application scenario may be, for example, an application scenario in which a user interacts with a customer service system. In this application scenario, the specific text expression of the question input by the user to the customer service system is non-standardized, and in order to improve the intelligence degree of the customer service system, a plurality of similar questions obtained based on the standard question are often deployed in the customer service system, and the similar questions corresponding to the same standard question are all matched to the same standard answer.
FIG. 1 illustratively provides an application scenario in which a user interacts with a customer service system. As shown in fig. 1, when the user uses the customer service system, the user inputs a question to be asked, for example, "how to know his/her score", to the client system, and from the viewpoint of business, the customer service system is expected to identify a non-standardized question, such as "how to know his/her score", as a similar question to the standard question "query user score", and further feed back a standard answer corresponding to the "query user score" to the user. Therefore, it is required that the standard problem "query user points" deployed in advance in the client system includes "how to know their points", which requires that the standard problem "query user points" can be used to obtain the similar problem "how to know their points".
If the similar question obtained based on the standard question "inquire user's score" is "inquire user's bill", and the similar question obtained based on the standard question "how to obtain the score" is "know the score of itself", then the similar question thus generated is only similar to the standard question in literal expression, but loses the core semantics of the standard question, which easily causes the customer service system to match the non-standardized question input by the user to the standard question with larger semantic deviation, and then easily feeds back the standard answer corresponding to the standard question with larger deviation to the user, thereby bringing the bad experience of "not answering questions" to the user. Or, the client system is easy to be unable to understand the non-standardized problem of the user input, and a bad experience of 'one question of three unknown' is brought to the user.
Therefore, how to enable the acquired similar texts to retain the core semantics of the standard texts is crucial.
Therefore, in the technical scheme provided by the disclosure, the standard text is considered to have a certain core semantic structure, and if the similar text can inherit the core semantic structure of the standard text as much as possible, an overlarge semantic information deviation does not exist between the similar text and the standard text. In technical implementation, a technical means of semantic role labeling can be adopted to obtain the representation corresponding to the core semantic structure of the standard text, and then the obtained representation of the core semantic structure can be referred to obtain the similar text.
In one or more embodiments provided by the present disclosure, the text feature of the standard text is determined not only based on the set of vectors mapped by the standard text, but also based on the set of vectors mapped by the semantic character labeling result of each word in the labeled text. And inputting the determined text characteristics into a similar text generation model to obtain at least one similar text.
The text features of the standard text determined in this way can contain information on the aspect of the literal expression of the standard text and information on the aspect of the core semantic structure of the standard text. The similar text obtained by inputting the text features into the similar text generation model not only has similarity with the standard text in the aspect expression, but also inherits the core semantic structure of the standard text as much as possible. The similar text does not have a larger deviation from the standard text in core semantics and is more effective.
Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.
Fig. 2 exemplarily provides a method flow for obtaining similar texts, which includes the following steps:
s200: and acquiring a standard text.
S202: text features of the standard text are determined.
S204: inputting the text features of the standard text into a similar text generation model, and outputting at least one similar text.
In one or more embodiments of the present disclosure, the model similar text is generated by using the similar text, in these embodiments, the text features of the standard text are generally required to be used as the model input, and the model output is the similar text.
The step of determining text features of the standard text may comprise: mapping the standard text into a vector set; performing semantic role labeling on each word in the standard text, and mapping a labeling result into a vector set; and determining the text characteristics of the standard text according to the two vector sets obtained by mapping.
Wherein the set of vectors comprises at least one vector. In some embodiments, if the vector set includes a plurality of vectors, each vector in the vector set may be regarded as a row (or a column) to obtain a matrix.
The vector set mapped by the standard text contains information of the literal expression aspect of the standard text. The vector set mapped by the semantic role labeling result of each word in the standard text contains the information in the core semantic structure of the standard text.
The mapping of the standard text into a set of vectors has the effect of obtaining a mathematical representation of the standard text over the word representation. The word expressions of the standard text can be vectorized by using a common text mapping algorithm.
And the semantic role labeling technology is used for performing semantic role labeling on each word in the sentence. The semantic role labeling object is a single sentence, the single sentence has an independent semantic structure, and the standard text can include one or more sentences, so the semantic role labeling process is performed on the labeled text, and actually, the semantic role labeling process is performed on each sentence in the standard text respectively.
It should be noted that before semantic role labeling is performed on a sentence, word segmentation is usually performed on the sentence. For example, an open source Word segmentation tool such as HanNLP, Stanford Word Segmenter, etc. can be used for Word segmentation.
For example, for a standard text "ask for a star dew cereal language mobile phone chinese version download address, not requiring a hundred degree cloud", it includes two sentences, and the two sentences are both participled to obtain:
"find | star and valley | object language | mobile phone | Chinese | version | download | address |, | not | Baidu cloud".
Semantic role labeling technologies generally support labeled semantic roles including predicates, core arguments, and additional arguments. Wherein, the predicate is generally a verb or an adjective; core arguments are words directly related to predicates, usually acting as subjects or objects in a sentence; the additional argument is other words in the statement except for the predicate and the core argument.
When semantic role labeling is performed, a label corresponding to the predicate may be set as PERD, a label corresponding to the core argument may be set as ARG, and a label corresponding to the additional argument may be set as ARGM.
In addition, core arguments and additional arguments may also be divided at a finer granularity. For example, there may be multiple categories of core arguments, multiple categories of additional arguments.
FIG. 3 illustratively provides a detailed classification table of core arguments with additional arguments. The label corresponding to the core argument is set to ARG-N, where N represents a number from 0 to 5, and different numbers are used to distinguish different types of core arguments, which can be specifically seen in fig. 3. The tag corresponding to the additional argument is set as ARGM-XXX, XXX represents an identifier comprising 3 letters, and different identifiers are used for distinguishing different types of additional arguments, which can be specifically shown in FIG. 3.
In some embodiments, the semantic role labeling can be performed on the sentences by refining to the level of the specific core argument kind and the additional argument kind shown in fig. 3.
In other embodiments, semantic role labeling may be performed on the sentences at the level of whether the sentences belong to the core argument or the accessory argument, instead of being detailed to the level of the specific core argument type and the additional argument type shown in fig. 3.
Continuing to illustrate, semantic role labeling is performed on a standard text "find | star and valley | object language | mobile phone | chinese | version | download | address |, | not | Baidu cloud", and according to a word segmentation result, a total of 12 word segmentation positions (comma also counts as a word segmentation position) can be seen, and the labeling result can include:
labeling result 1: (find, PRED, 0, 1);
labeling result 2: (Xinglu cereal language Chinese version download address in cell phone, ARG1, 1, 8);
labeling result 3: (not, ARGM-ADV, 9, 10);
labeling result 4: (to, PRED, 10, 11);
labeling result 5: (Baidu cloud, ARG1, 11, 12).
(find, PRED, 0, 1) indicates that after the 0 th position, the 1 st position is 'find' and the tag is 'PRED' (i.e., belongs to the predicate).
(Xinglu grain language Chinese version download address in cell phone, ARG1, 1, 8) shows that after the 1 st position, the labels of the 2 nd to 8 th positions are all 'ARG 1' (i.e. core argument-subject).
(not, ARGM-ADV, 9, 10) indicates that after the 9 th position (comma), the 10 th position is 'not' and the tag is 'ARGM-ADV' (i.e., additional argument-status language).
(to, PRED, 10, 11) indicates that after the 10 th position, the 11 th position is 'to' and the tag is 'PRED'.
(Baidu cloud, ARG1, 11, 12) indicates that after the 11 th position, the 12 th position is 'Baidu cloud' labeled 'ARG 1'.
The labeling result 1 and the labeling result 2 belong to the labeling result corresponding to the first sentence in the standard text, and the labeling results 3 to 5 belong to the labeling results corresponding to the second sentence in the standard text.
In the step of mapping the labeling result into a vector set, various methods can be adopted as long as the labeling result can be converted into a mathematical representation in the form of a vector set.
In some embodiments, if the standard text includes at least two sentences having independent semantic structures, the part of the annotation result corresponding to the sentence can be mapped to the vector corresponding to the sentence for each sentence having an independent semantic structure. Then, vectors corresponding to each sentence with an independent semantic structure may be combined into a vector set, or vectors corresponding to each sentence with an independent semantic structure may be combined into one vector.
In some embodiments, for each sentence with an independent semantic structure included in the standard text, the sentence includes N words, and a part corresponding to the sentence in the labeling result is mapped into an N-dimensional vector; and the dimensions in the N-dimensional vector correspond to the words in the sentence one by one, and the value of any dimension is determined based on the semantic role of the word corresponding to the dimension.
In some embodiments, for each sentence with an independent semantic structure included in the standard text, the sentence includes M words, and a part corresponding to the sentence in the annotation result is mapped into an M-dimensional vector; and the value of any dimension is determined based on the semantic role to which the word corresponding to the dimension belongs.
For example, still following the above example, a single word (or a single punctuation mark) in the standard text is used as a vector dimension, and if the label corresponding to this dimension is ARG, the mapping is 1; if the label corresponding to the dimension is ARGM, the label is mapped to 2; if the label corresponding to this dimension is PRED, the mapping is 3, and if the label corresponding to this dimension is a punctuation mark (such as comma), the mapping is 0. Thus, the standard text "ask for the star dew cereal language mobile phone chinese version download address, do not require Baidu cloud" can be mapped as the following vector:
(3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,2,3,1,1,1)。
this vector includes 21 dimensions, one-to-one corresponding to 21 words (or punctuation) in standard text. This vector may be used as a set of vectors for standard text.
For another example, two sentences in the standard text may be mapped respectively based on the same mapping rule as in the previous example to obtain two vectors including 21 dimensions, and a vector set corresponding to the standard text is formed as follows:
(3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0) and (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 1, 1, 1).
Since the first sentence includes 15 words, the 16 th to 21 st dimensions of the vector corresponding to the first sentence are all 0. Since statement two includes 5 words, the 1 st dimension to the 16 th dimension of the vector corresponding to statement one are all 0.
After the standard text is mapped into a vector set and a labeling result obtained by performing semantic role labeling on the standard text is mapped into the vector set, text features to be input into a similar text generation model can be determined according to the two vector sets. There are various ways of determining the text features according to the two vector sets, as long as the mathematical representations corresponding to the two vector sets can be combined to obtain a new mathematical representation.
In some embodiments, the two vector sets obtained by mapping may be combined into a new vector set as a text feature of the standard text. For example, each vector in the two vector sets can be directly used as a row (or a column) to form a matrix as a text feature.
And inputting the obtained text features into a similar text generation model to generate at least one similar text.
It should be noted that the similar text generation model applied in the method flow shown in fig. 2 may be constructed and trained in advance. In addition, before the similar text generation model is used to implement the method flow shown in fig. 2, a model construction phase and a model training phase may be performed.
Model construction and model training are described herein.
In the model building stage, the similar text generation model can be built by adopting various text generation algorithms. In some embodiments, a SimBERT algorithm may be employed to construct a similar text generation model.
In addition, a multi-head attention mechanism algorithm may be used to construct the similar text generation model, or a recurrent neural network algorithm (e.g., LSTM algorithm or GRU algorithm) may be used to construct the similar text generation model, which is not listed herein.
In the model training stage, a training sample set can be obtained, wherein each training sample comprises a first type of text and a second type of text; the first type of text and the second type of text included in the same training sample have the same content meaning.
Next, for each training sample, text features of each text included in the training sample may be determined, including: mapping the text into a set of vectors; performing semantic role labeling on each word in the text, and mapping a labeling result into a vector set; and determining the text characteristics of the text according to the two vector sets obtained by mapping.
Then, the text features of the first type of text included in the training sample can be used as model input, and the text features of the second type of text included in the training sample can be used as model output, so that the similar text generation model can be trained.
It should be noted that, in the process of training the similar text generation model, actually, the text features of the first type of text in one training sample are used as model input, and the text features of the second type of text in the same training sample are used as model output, so that the model learns the similarity rule of the first type of text and the second type of text with the same content meaning on the core semantic structure. Therefore, the trained similar text generation model can generate a similar text which accords with the learned similarity rule according to the input standard text.
In addition, in some embodiments provided by the present disclosure, in addition to generating several similar texts using a similar text generation model, more similar texts may be obtained using other approaches.
For example, a first translation tool may be invoked to translate the standard text in a first language version to text in a second language version. Then, a second translation tool is invoked to translate the text in the second language version back to the text in the first language version as similar text.
The first translation tool and the second translation tool may be the same translation tool or different translation tools. The first language version is a different language version than the second language version.
Of course, more translation tools (e.g., a third translation tool) may be used, or the standard text in the first language version may be translated into more other language versions (e.g., a third language version) and translated back to obtain more similar text.
In another example, a search engine may be invoked to perform a search using the standard text as a search object. And then, a plurality of search results which are similar to the meaning of the search object and are designated by the search engine are used as similar texts. If the technical scheme of the present disclosure needs to be implemented in an application scenario where a user interacts with a customer service system, the search engine may be a question and answer community website.
After a plurality of similar texts are obtained from one or more ways, the similarity between the standard text and each similar text can be respectively calculated, and the standard text and each similar text are sorted from big to small based on the similarity. In some implementations, similar text with a similarity less than a specified threshold may be discarded.
The present disclosure provides a technical solution for calculating similarity between a standard text and a similar text, as follows:
the standard text and the similar text may be vectorized separately. Taking a standard text as an example, the standard text may be segmented, on one hand, a tf-idf value (a standard measure) of each word in the standard text is calculated, and on the other hand, a word vector of each word in the standard text is queried based on a word vector dictionary. It should be noted that if a word is not in the word vector dictionary, the word vector of the word may be set to 0. And taking the tf-idf value of each word in the standard text as the weight corresponding to the word vector of the word, and calculating the weighted sum of the word vectors of each word in the standard text to obtain the vector corresponding to the standard text. By using a similar method, vectors corresponding to similar texts can also be obtained.
The distance (e.g., cosine distance) between the standard text vector and the similar text vector is calculated. The cosine distance can be used as a similarity characterization between the standard text and the similar text.
In addition, the editing distance between the standard text and the similar text can be obtained, the cosine distance and the editing distance are integrated, and the similarity representation between the standard text and the similar text is determined. For example, a weighted sum may be calculated for the cosine distance and the edit distance, resulting in a similarity characterization between the standard text and the similar text.
In addition, in a scene that a user interacts with the customer service system, the customer service system can display a cold start configuration interface. The client system may then retrieve a number of similar questions corresponding to the standard questions entered into the configuration interface. Then, the customer service system associates at least part of the obtained similar questions with the standard answers corresponding to the standard questions in the cold starting process.
The cold start of the customer service system refers to a process of storing a plurality of similar questions for each standard question before the customer service system provides service for a user.
In some embodiments, the customer service system may cold start the configuration interface, presenting the retrieved at least one similar question. The customer service system may determine the selected similar question in response to a selection signal for the presented similar text, and associate the selected similar question with a standard answer corresponding to the standard question text.
In some embodiments, the customer service system may calculate a similarity of each of the obtained similar questions to the standard question. And then displaying the acquired at least one similar text in a sequence form according to the sequence of the similarity from big to small.
FIG. 4 illustratively provides a cold start configuration interface. When the customer service system needs to perform cold start, a cold start configuration interface which can be displayed to an administrator can be shown in fig. 4, the cold start configuration interface can provide a function of searching for a plurality of similar problems based on standard problems, and the search path can include a model generation path and a search engine search path. In addition, the search path may also include a language version retranslation path. The administrator may select the appropriate similar question and click the download to associate the selected similar question with the standard question.
Fig. 5 exemplarily provides an apparatus for acquiring similar texts, including:
a standard text acquisition module 501 for acquiring a standard text;
the text feature determining module 502 determines text features of the standard text, including: mapping the standard text into a vector set; performing semantic role labeling on each word in the standard text, and mapping a labeling result into a vector set; determining the text characteristics of the standard text according to the two vector sets obtained by mapping;
the similar text obtaining module 503 inputs the text features of the standard text into a similar text generation model, and outputs at least one similar text.
In some embodiments, the similar text generation model is constructed using a SimBERT algorithm;
or the similar text generation model is constructed by a multi-head attention mechanism algorithm;
or the similar text generation model is constructed by adopting a recurrent neural network algorithm.
In some embodiments, the text feature determining module 502, if the standard text includes at least two sentences having independent semantic structures, for each sentence having an independent semantic structure, maps a portion of the annotation result corresponding to the sentence into a vector corresponding to the sentence; and forming vector sets by vectors respectively corresponding to the statements with the independent semantic structures, or combining the vectors respectively corresponding to the statements with the independent semantic structures into one vector.
In some embodiments, the text feature determining module 502 maps, for each sentence with an independent semantic structure included in the standard text, the sentence including N words, a part of the labeling result corresponding to the sentence into an N-dimensional vector; and the dimensions in the N-dimensional vector correspond to the words in the sentence one by one, and the value of any dimension is determined based on the semantic role of the word corresponding to the dimension.
In some embodiments, the text feature determining module 502 combines the two vector sets obtained by mapping into a new vector set, which is used as the text feature of the standard text.
The similar text generation model is trained by the following method:
acquiring a training sample set, wherein each training sample comprises a first type of text and a second type of text; the first type of text and the second type of text included in the same training sample have the same content meaning;
for each training sample, determining text features of each text included in the training sample, including: mapping the text into a set of vectors; performing semantic role labeling on each word in the text, and mapping a labeling result into a vector set; determining the text characteristics of the text according to the two vector sets obtained by mapping;
and training a similar text generation model by taking the text features of the first type of text included in the training sample as model input and the text features of the second type of text included in the training sample as model output.
In some embodiments, the similar text acquiring module 503 invokes a first translation tool to translate the standard text of the first language version into the text of the second language version;
and calling a second translation tool to translate the text of the second language version back to the text of the first language version as similar text.
In some embodiments, the similar text obtaining module 503 takes the standard text as a search object, and invokes a search engine to perform a search; and taking a plurality of search results which are specified by the search engine and have similar meanings with the search object as similar texts.
In some embodiments, the apparatus is applied to a customer service system, and the apparatus further comprises:
a cold start module 504 that presents a cold start configuration interface; acquiring a plurality of similar problems corresponding to the standard problems input into the configuration interface; and during the cold starting process, associating at least part of the acquired similar questions with the standard answers corresponding to the standard questions.
In some embodiments, the cold start module 504, via the configuration interface, presents the retrieved at least one similar question; in response to a selection signal for the presented similar text, determining a selected similar question; associating at least part of the obtained similar questions with the standard answers corresponding to the standard questions, wherein the method comprises the following steps:
and associating the selected similar questions with the standard answers corresponding to the standard question texts.
In some embodiments, the cold start module 504 calculates a similarity between each obtained similar question and the standard question; and displaying the acquired at least one similar text in a sequence form according to the sequence of the similarity from big to small.
It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Fig. 6 is a schematic diagram of a computer-readable storage medium 140 provided by the present disclosure, and a computer program is stored on the medium 140, and when the computer program is executed by a processor, the method for adjusting information recommendation weight according to any embodiment of the present disclosure is implemented.
The present disclosure also provides a computing device comprising a memory, a processor; the memory is used for storing computer instructions executable on the processor, and the processor is used for realizing the method for acquiring similar texts of any embodiment of the disclosure when executing the computer instructions.
Fig. 7 is a schematic structural diagram of a computing device provided by the present disclosure, and as shown in fig. 7, the computing device 15 may include, but is not limited to: a processor 151, a memory 152, and a bus 153 that connects the various system components, including the memory 152 and the processor 151.
Wherein the memory 152 stores computer instructions executable by the processor 131 to enable the processor 151 to perform a method of obtaining similar text according to any of the embodiments of the present disclosure. The memory 152 may include a random access memory unit RAM1521, a cache memory unit 1522, and/or a read only memory unit ROM 1523. The memory 152 may further include: a program tool 1525 having a set of program modules 1524, the program modules 1524 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, one or more combinations of which may comprise an implementation of a network environment.
The bus 153 may include, for example, a data bus, an address bus, a control bus, and the like. The computing device 15 may also communicate with an external device 155 through the I/O interface 154, the external device 155 may be, for example, a keyboard, a bluetooth device, etc. The computing device 150 may also communicate with one or more networks, which may be, for example, local area networks, wide area networks, public networks, etc., through the network adapter 156. The network adapter 156 may also communicate with other modules of the computing device 15 via the bus 153, as shown in FIG. 7.
Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method of obtaining similar text, comprising:
acquiring a standard text;
determining text features of the standard text, including: mapping the standard text into a vector set; performing semantic role labeling on each word in the standard text, and mapping a labeling result into a vector set; determining the text characteristics of the standard text according to the two vector sets obtained by mapping;
inputting the text features of the standard text into a similar text generation model, and outputting at least one similar text.
2. The method of claim 1, wherein the similar text generation model is constructed using a SimBERT algorithm;
or the similar text generation model is constructed by a multi-head attention mechanism algorithm;
or the similar text generation model is constructed by adopting a recurrent neural network algorithm.
3. The method of claim 1, mapping the annotation result to a set of vectors, comprising:
if the standard text comprises at least two sentences with independent semantic structures, mapping the part corresponding to the sentence in the labeling result into a vector corresponding to the sentence aiming at each sentence with an independent semantic structure;
and forming vector sets by vectors respectively corresponding to the statements with the independent semantic structures, or combining the vectors respectively corresponding to the statements with the independent semantic structures into one vector.
4. The method of any of claims 1-3, mapping the annotated results to a set of vectors, comprising:
for each statement with an independent semantic structure included in the standard text, wherein the statement comprises N words, and mapping a part corresponding to the statement in the labeling result into an N-dimensional vector; and the dimensions in the N-dimensional vector correspond to the words in the sentence one by one, and the value of any dimension is determined based on the semantic role of the word corresponding to the dimension.
5. The method of claim 1, determining the text feature of the standard text according to the two vector sets obtained by mapping, comprising:
and forming a new vector set by the two vector sets obtained by mapping, wherein the new vector set is used as the text characteristic of the standard text.
6. The method of claim 1, wherein the similar text generation model is trained by:
acquiring a training sample set, wherein each training sample comprises a first type of text and a second type of text; the first type of text and the second type of text included in the same training sample have the same content meaning;
for each training sample, determining text features of each text included in the training sample, including: mapping the text into a set of vectors; performing semantic role labeling on each word in the text, and mapping a labeling result into a vector set; determining the text characteristics of the text according to the two vector sets obtained by mapping;
and training a similar text generation model by taking the text features of the first type of text included in the training sample as model input and the text features of the second type of text included in the training sample as model output.
7. The method of claim 1, further comprising:
calling a first translation tool to translate the standard text of the first language version into a text of a second language version;
and calling a second translation tool to translate the text of the second language version back to the text of the first language version as similar text.
8. An apparatus for obtaining similar text, comprising:
the standard text acquisition module is used for acquiring a standard text;
the text characteristic determining module is used for determining the text characteristic of the standard text and comprises the following steps: mapping the standard text into a vector set; performing semantic role labeling on each word in the standard text, and mapping a labeling result into a vector set; determining the text characteristics of the standard text according to the two vector sets obtained by mapping;
and the similar text acquisition module inputs the text characteristics of the standard text into a similar text generation model and outputs at least one similar text.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
10. A computing device comprising a memory, a processor; the memory is for storing computer instructions executable on the processor for implementing the method of any one of claims 1 to 7 when the computer instructions are executed.
CN202110871649.0A 2021-07-30 2021-07-30 Method, medium, device and computing equipment for acquiring similar texts Pending CN113535927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110871649.0A CN113535927A (en) 2021-07-30 2021-07-30 Method, medium, device and computing equipment for acquiring similar texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110871649.0A CN113535927A (en) 2021-07-30 2021-07-30 Method, medium, device and computing equipment for acquiring similar texts

Publications (1)

Publication Number Publication Date
CN113535927A true CN113535927A (en) 2021-10-22

Family

ID=78121619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110871649.0A Pending CN113535927A (en) 2021-07-30 2021-07-30 Method, medium, device and computing equipment for acquiring similar texts

Country Status (1)

Country Link
CN (1) CN113535927A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN112487151A (en) * 2020-12-14 2021-03-12 深圳市欢太科技有限公司 File generation method and device, storage medium and electronic equipment
CN112528647A (en) * 2020-12-07 2021-03-19 中国平安人寿保险股份有限公司 Similar text generation method and device, electronic equipment and readable storage medium
CN112668315A (en) * 2020-12-23 2021-04-16 平安科技(深圳)有限公司 Automatic text generation method, system, terminal and storage medium
CN112949293A (en) * 2021-02-02 2021-06-11 深圳市优必选科技股份有限公司 Similar text generation method, similar text generation device and intelligent equipment
CN112966081A (en) * 2021-03-05 2021-06-15 北京百度网讯科技有限公司 Method, device, equipment and storage medium for processing question and answer information
CN113138920A (en) * 2021-04-20 2021-07-20 中国科学院软件研究所 Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN113157727A (en) * 2021-05-24 2021-07-23 腾讯音乐娱乐科技(深圳)有限公司 Method, apparatus and storage medium for providing recall result

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN112528647A (en) * 2020-12-07 2021-03-19 中国平安人寿保险股份有限公司 Similar text generation method and device, electronic equipment and readable storage medium
CN112487151A (en) * 2020-12-14 2021-03-12 深圳市欢太科技有限公司 File generation method and device, storage medium and electronic equipment
CN112668315A (en) * 2020-12-23 2021-04-16 平安科技(深圳)有限公司 Automatic text generation method, system, terminal and storage medium
CN112949293A (en) * 2021-02-02 2021-06-11 深圳市优必选科技股份有限公司 Similar text generation method, similar text generation device and intelligent equipment
CN112966081A (en) * 2021-03-05 2021-06-15 北京百度网讯科技有限公司 Method, device, equipment and storage medium for processing question and answer information
CN113138920A (en) * 2021-04-20 2021-07-20 中国科学院软件研究所 Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN113157727A (en) * 2021-05-24 2021-07-23 腾讯音乐娱乐科技(深圳)有限公司 Method, apparatus and storage medium for providing recall result

Similar Documents

Publication Publication Date Title
US10824658B2 (en) Implicit dialog approach for creating conversational access to web content
AU2019200437B2 (en) A method to build an enterprise-specific knowledge graph
US20190163691A1 (en) Intent Based Dynamic Generation of Personalized Content from Dynamic Sources
JP6414956B2 (en) Question generating device and computer program
JP2021507350A (en) Reinforcement evidence retrieval of complex answers
US10838993B2 (en) Augment politeness in question answer systems
US11080073B2 (en) Computerized task guidance across devices and applications
KR101851789B1 (en) Apparatus and method for generating domain similar phrase
KR20210030068A (en) System and method for ensemble question-answering
US11907863B2 (en) Natural language enrichment using action explanations
US10552461B2 (en) System and method for scoring the geographic relevance of answers in a deep question answering system based on geographic context of a candidate answer
US10902342B2 (en) System and method for scoring the geographic relevance of answers in a deep question answering system based on geographic context of an input question
US11531821B2 (en) Intent resolution for chatbot conversations with negation and coreferences
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN112465144A (en) Multi-modal demonstration intention generation method and device based on limited knowledge
US11769013B2 (en) Machine learning based tenant-specific chatbots for performing actions in a multi-tenant system
US20220147719A1 (en) Dialogue management
US10303765B2 (en) Enhancing QA system cognition with improved lexical simplification using multilingual resources
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN112015915A (en) Question-answering system and device based on knowledge base generated by questions
KR20210147368A (en) Method and apparatus for generating training data for named entity recognition
JP2021039727A (en) Text processing method, device, electronic apparatus, and computer-readable storage medium
US20220343087A1 (en) Matching service requester with service providers
US20100145972A1 (en) Method for vocabulary amplification
CN113535927A (en) Method, medium, device and computing equipment for acquiring similar texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination