CN113139040A - Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm - Google Patents

Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm Download PDF

Info

Publication number
CN113139040A
CN113139040A CN202110367263.6A CN202110367263A CN113139040A CN 113139040 A CN113139040 A CN 113139040A CN 202110367263 A CN202110367263 A CN 202110367263A CN 113139040 A CN113139040 A CN 113139040A
Authority
CN
China
Prior art keywords
question
text
similarity
answer
answer pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110367263.6A
Other languages
Chinese (zh)
Other versions
CN113139040B (en
Inventor
嵇望
王伟凯
钱艳
朱鹏飞
安毫亿
梁青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yuanchuan New Technology Co ltd
Original Assignee
Hangzhou Yuanchuan New Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yuanchuan New Technology Co ltd filed Critical Hangzhou Yuanchuan New Technology Co ltd
Priority to CN202110367263.6A priority Critical patent/CN113139040B/en
Publication of CN113139040A publication Critical patent/CN113139040A/en
Application granted granted Critical
Publication of CN113139040B publication Critical patent/CN113139040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The application relates to a method, a system, an electronic device and a storage medium for generating similar problems based on a text similarity algorithm, wherein the method for generating similar problems based on the text similarity algorithm comprises the following steps: acquiring interactive scene text data and generating a question-answer pair text; calculating the text similarity between the answer text in the question-answer pair text and the answer text in the industry question-answer pair, and determining the industry question-answer pair with the maximum similarity; calculating the text similarity between the question text in the question-answer pair text and the question text in the industry question-answer pair with the largest similarity, and determining the maximum value of the similarity of the question text; and comparing the maximum similarity of the question texts with a preset threshold, and if the maximum similarity of the question texts meets the preset threshold, supplementing the question texts in the question-answer pair texts corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpora to serve as the similar questions of the question texts. Through the method and the device, the problems that similar problem sentences generated in the related technology are not smooth and contain redundant contents are solved.

Description

Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, a system, an electronic device, and a storage medium for generating a similarity problem based on a text similarity algorithm.
Background
In the intelligent interaction process, the text intention of the user needs to be recognized, and then a corresponding process is triggered.
The current intelligent customer service robot technology mainly utilizes a machine learning algorithm to identify intentions. The machine learning algorithm needs to be trained based on a large number of similar corpora, so that a large number of labeled data need to be prepared at the initial construction stage of the intelligent customer service robot. At present, training data are generally generated by manual marking by service personnel, and the problems of long period, high marking cost and the like exist, so that how to automatically acquire related similar corpus data at the initial stage of intelligent robot construction is particularly critical.
To solve the above problems, in the prior art, chinese patent application CN201810749005.2 discloses a method and an apparatus for automatically generating FAQ-like question sentences, the method comprising: generating a text based on the selected FAQ; judging whether the generated text is similar to the selected FAQ or not; and if the generated text is similar to the selected FAQ, the text is a similar question sentence of the selected FAQ. Although the method can automatically generate the similar question sentences, the similar question sentences are generated based on the sentence generation rule, and the rule is inconvenient to maintain; similar question sentences output according to rules may have problems such as language sickness and cannot be directly used as training data.
Another chinese patent application CN201811029233.9 discloses a question-answer pair construction method, device and computer readable storage medium, the method comprising: acquiring a conversation record of an artificial customer service and a user, and processing the conversation record based on a preset rule to obtain a target conversation record; determining a standard form question-answer pair based on the target dialogue record, and filtering the standard form question-answer pair to obtain a target question-answer pair; and combining the target question-answer pairs, and outputting the combined target question-answer pairs for the administrator to check. When constructing question-answer pairs, the patent adopts a method for judging whether question sentences are contained, but the query questions proposed by the user in a real scene do not necessarily contain question words, so the output corpus depends on the maintenance degree of the question words; the merging of the target question-answer pairs adopts a method with the same answers, and the question-answer pairs with a large number of similar answers still need manual examination, so that the efficiency is not high; in addition, when constructing question-answer pairs, continuous interactive texts of user roles and customer service roles need to be merged respectively, so that finally generated similar problems are long, and excessive text contents irrelevant to the problems are possibly contained, and the final model training effect is influenced.
At present, no effective solution is provided for the problems that similar problem statements generated in the related art are not smooth and contain redundant contents.
Disclosure of Invention
The embodiment of the application provides a method, a system, an electronic device and a storage medium for generating a similar problem based on a text similarity algorithm, so as to at least solve the problems that similar problem sentences generated in the related technology are not smooth and contain redundant content.
In a first aspect, an embodiment of the present application provides a method for generating a similarity problem based on a text similarity algorithm, where the method includes:
acquiring interactive scene text data and generating question-answer pair text data;
calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum answer text similarity, and determining the question-answer pair text data corresponding to the maximum question text similarity;
and comparing the maximum similarity of the question texts with a preset threshold, and if the maximum similarity of the question texts meets the preset threshold, supplementing the question texts in the question-answer pair text data corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpus to serve as the similar questions of the question texts in the corresponding industry question-answer pair corpus.
In some embodiments, the obtaining interactive scene text data and generating question-answer pair text data includes:
acquiring interactive scene text data, and splicing continuous interactive texts in the interactive scene text data according to a time sequence;
and grouping the spliced interactive scene text data according to interactive turns, and combining the user text and the customer service text according to the sequence under each interactive turn to obtain question-answer pair text data.
In some embodiments, before the temporally splicing the consecutive interactive texts in the interactive scene text data, the method includes:
and performing data cleaning on the interactive scene data, and removing interactive scene text data with the number of interactive rounds being larger than a second preset threshold value.
In some embodiments, the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and the text similarity between the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity are respectively calculated through a text similarity algorithm; wherein the text similarity is any one of a cosine similarity algorithm, an euclidean distance, a manhattan distance, a minkowski distance, and a chebyshev distance.
In some embodiments, the calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and determining the industry question-answer pair corpus corresponding to the maximum answer text similarity includes:
and calculating the text similarity of the answer text of each interactive turn in the question-answer pair text data and the answer text in the industry question-answer pair corpus, and determining the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity of each interactive turn.
In some embodiments, the calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum answer text similarity and determining the question-answer pair text data corresponding to the maximum question text similarity includes:
and respectively calculating the text similarity of the question text of each interactive turn in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity of each interactive turn, and determining the question-answer pair text data corresponding to the maximum value of the question text similarity of each interactive turn.
In some embodiments, the comparing the maximum similarity of the question texts with a preset threshold includes:
and comparing the maximum value of the similarity of the question texts of each interaction turn with a preset threshold value respectively.
In a second aspect, an embodiment of the present application provides a system for generating a similarity problem based on a text similarity algorithm, where the system includes: the system comprises a data acquisition module, an answer text similarity calculation module, a question text similarity calculation module and a similar question extraction module:
the data acquisition module is used for acquiring interactive scene text data and generating question-answer pair text data;
the answer text similarity calculation module is used for calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and determining the industry question-answer pair corpus corresponding to the maximum answer text similarity;
the question text similarity calculation module is used for calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum answer text similarity, and determining the question-answer pair text data corresponding to the maximum question text similarity;
and the similar question extraction module is used for comparing the maximum similarity of the question texts with a preset threshold, and if the preset threshold is met, supplementing the question texts in the question-answer pair text data corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpus to serve as the similar questions of the question texts in the corresponding industry question-answer pair corpus.
In a third aspect, an embodiment of the present application provides an electronic apparatus, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the method for generating a similarity problem based on a text similarity algorithm according to the first aspect.
In a fourth aspect, embodiments of the present application provide a storage medium, on which a computer program is stored, which when executed by a processor, implements the method for generating a similarity problem based on a text similarity algorithm as described in the first aspect above.
Compared with the related technology, the method for generating the similar problem based on the text similarity algorithm extracts the similar problem from the interactive scene corpus by combining the industry question-answer corpus, and avoids the generated sentence incompleteness of the similar problem. And because the similarity problem is extracted from the real interactive scene corpus, the extracted similarity problem is more comprehensive and better conforms to the real interactive scene. And similar question extraction and integration are carried out on the question texts with similar answers by using a mode of combining text similarity with a threshold value, so that the finally obtained similar questions do not contain redundant contents.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for generating a similarity problem based on a text similarity algorithm according to an embodiment of the present application;
FIG. 2 is a block diagram of a similarity problem generation system based on a text similarity algorithm according to an embodiment of the present application;
fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference herein to "a plurality" means greater than or equal to two. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The present embodiment provides a method for generating a similar problem based on a text similarity algorithm, and fig. 1 is a flowchart of a method for generating a similar problem based on a text similarity algorithm according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:
and step S101, acquiring interactive scene text data and generating question-answer pair text data. In the embodiment, interactive scene text data accumulated in a real artificial customer service interactive scene is obtained, and the interactive scene text data is labeled according to a user text and a customer service text in advance and is sequenced according to a time sequence. Extracting question-answer pairs from the interactive scene text data to form question-answer pair text data;
step S102, calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus, and determining the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity. In this embodiment, the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus is sequentially calculated, the answer text with the maximum similarity to the answer text in the question-answer pair text data is determined in the industry question-answer pair corpus, and then the question text corresponding to the answer text with the maximum similarity is determined from the industry question-answer pair text data, so as to determine the industry question-answer pair corpus corresponding to the maximum similarity of the answer text; it should be noted that the industry question-and-answer data can be obtained from an industry customer service technical database, which is a pre-constructed database for storing customer service technical data including questions and answers corresponding to the questions. It should be noted that, if the answer text in the question-answer pair text data already exists in the industry question-answer pair corpus, the step S103 is directly executed without calculating the answer text in the industry question-answer pair corpus that is most similar to the answer text again.
Step S103, calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum value of the similarity of the answer text, and determining the question-answer pair text data corresponding to the maximum value of the similarity of the question text. In this embodiment, after the industry question-answer corpus most similar to the answer text in the question-answer pair text data is determined, the similarity between the question text in the industry question-answer corpus most similar to the answer text and the question text in the question-answer pair text data is calculated, so that the question text most similar to the question text in the industry question-answer pair corpus is determined from the question-answer pair text data.
And step S104, comparing the maximum similarity of the question texts with a preset threshold, and if the maximum similarity of the question texts meets the preset threshold, supplementing the question texts in the question-answer pair text data corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpus to serve as similar questions of the question texts in the corresponding industry question-answer pair corpus. In this embodiment, the maximum similarity value calculated in step S103 is compared with a preset threshold, and if the maximum similarity value is greater than or equal to the preset threshold, the question text in the question-answer text data corresponding to the maximum similarity value is supplemented to the corresponding industry question-answer corpus to serve as the similar question of the question text in the corresponding industry question-answer corpus.
The value of the preset threshold may be determined according to actual conditions, and is not limited herein. Illustratively, the value of the preset threshold may be determined according to a similarity evaluation index, where the similarity evaluation index includes, but is not limited to, a contour coefficient, a landed coefficient, an adjusted landed coefficient, and the like.
In other embodiments, after step S104, the supplemented similar questions may be reviewed.
Through the steps S101 to S104, in this embodiment, similar problems can be extracted from the interactive scene corpus generated in the real interactive scene by calculating and comparing the text similarity between the answer text and the question text, so that the problem that the generated similar problem text has discordance of sentences can be avoided, and based on the text similarity and threshold judgment, the similar problem extraction and integration are performed on the problem text with similar answers, so that the finally obtained similar problem does not contain redundant content, and the whole similar problem generation process does not need to construct a related text generation rule or a machine learning model to generate the similar problem, thereby reducing the difficulty in generating the similar problem.
In some embodiments, obtaining interactive scene text data and generating question-answer pair text data includes:
acquiring interactive scene text data, performing data cleaning on the interactive scene data, and removing the interactive scene text data with the number of interactive rounds larger than a second preset threshold value to obtain processed interactive scene text data;
splicing continuous interactive texts in the processed interactive scene text data according to a time sequence;
and grouping the spliced interactive scene text data according to interactive turns, and combining the user text and the customer service text according to the sequence under each interactive turn to obtain question-answer pair text data.
The continuous interactive text refers to text content which is not stopped by other roles in the current role interactive process, and the value of the second preset threshold value can be determined according to the actual situation, so that the number of interactive rounds in the interactive scene text data can be determined as required. It should be noted that the continuous interactive texts in the interactive scene data are spliced according to the time sequence, and the splicing is mainly performed for the customer service texts, and the user texts may be spliced or not spliced. The number of interaction turns refers to a total of several dialog turns, and the interaction turn refers to the number of dialog turns in the multiple dialog turns.
In some embodiments, after the question-answer pair text data is generated, in each interaction turn, the text similarity between the answer text in the question-answer pair text data and each answer text in the industry question-answer pair corpus is respectively calculated, and the industry question-answer corpus with the maximum answer text similarity to each interaction turn in the question-answer pair text data is determined; and respectively calculating the text similarity of the question text in the industry question and answer corpus with the maximum similarity with the question text of the question and answer pair text data in the corresponding interaction turn, so as to obtain the maximum value of the similarity of the question text in each interaction turn, then comparing the maximum value of the similarity of the question text with a preset threshold value, and if the maximum value of the similarity of the question text is greater than or equal to the preset threshold value, supplementing the question text in the question and answer pair text data corresponding to the maximum value of the corresponding similarity into the corresponding industry question and answer pair corpus to serve as the similar question of the corresponding question text.
In some embodiments, the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and the text similarity between the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum value of the similarity between the question text in the question-answer pair text data and the answer text are calculated through a text similarity algorithm; the text similarity includes, but is not limited to, cosine similarity algorithm, euclidean distance, manhattan distance, minkowski distance, and chebyshev distance, and any algorithm for measuring the distance between two vectors is applicable to the present embodiment to calculate the text similarity.
Specifically, the procedure of the method for generating a similarity problem based on the text similarity algorithm according to the present embodiment is illustrated here:
suppose that one interactive scene text data is:
the user: u1
Customer service: k1
The user: u2
The user: u3
Customer service: K2.
the number of times of interaction of the interactive scene text data is 2 rounds, and the generated question-answer pair text data is [ U1, K1], [ U2, K2], [ U3, K2], [ U2+ U3, K2 ]. Where U2+ U3 represents a concatenation of U2 text with U3 text. In the [ X, Y ] style, X is a question text that the user consults, and Y is an answer text that the customer service answers.
Suppose there are two sets of question-answer pair data [ Q1, a1], [ Q2, a2] in the industry question-answer pair corpus. In the 2 nd interaction turn, the text similarity between K2 and a1 and a2 obtained by calculation is 0.5 and 0.9 respectively, and the industry question-answer corpus corresponding to the maximum value of the answer text similarity corresponding to K2 is [ Q2, a2 ]. Then, the similarity of the texts of U2, U3, U2+ U3 and Q2 is respectively 0.9, 0.5 and 0.7 through calculation. The maximum text similarity of the question text under the interaction turn is 0.9, and if the threshold value is 0.8, the U2 is supplemented as a similar question of the question text Q2 of the corresponding business question-answer pair corpus [ Q2, a2 ].
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The present embodiment further provides a system for generating a similar problem based on a text similarity algorithm, where the system is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted here for brevity. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 2 is a block diagram of a similar problem generating system based on a text similarity algorithm according to an embodiment of the present application, and as shown in fig. 2, the system includes a data obtaining module 21, an answer text similarity calculating module 22, a question text similarity calculating module 23, and a similar problem extracting module 24:
the data acquisition module 21 is configured to acquire interactive scene text data and generate question-answer pair text data; the answer text similarity calculation module 22 is configured to calculate text similarities between answer texts in the question-answer pair text data and answer texts in the industry question-answer pair corpus, and determine an industry question-answer pair corpus corresponding to the maximum answer text similarity; the question text similarity calculation module 23 is configured to calculate text similarities of question texts in the question-answer pair text data and question texts in the industry question-answer pair corpus corresponding to the maximum value of the similarity of the question texts and the answer texts, and determine question-answer pair text data corresponding to the maximum value of the similarity of the question texts; the similar question extracting module 24 is configured to compare the maximum similarity of the question text with a preset threshold, and if the preset threshold is met, supplement the question text in the question-answer pair text data corresponding to the maximum similarity of the question text into the corresponding industry question-answer pair corpus to serve as the similar question of the question text in the corresponding industry question-answer pair corpus, thereby solving the problems that similar question sentences generated in the prior art are unsmooth and contain redundant content, improving the adaptability of the similar question to a real interactive scene, and reducing the difficulty in generating the similar question.
In some embodiments, the data obtaining module 21 is further configured to perform data cleaning on the interactive scene data to remove interactive scene text data with the number of interactive rounds being greater than a second preset threshold, and splice continuous interactive texts in the interactive scene text data after the data cleaning according to a time sequence; and grouping the spliced interactive scene text data according to interactive turns, and combining the user text and the customer service text according to the sequence under each interactive turn to obtain question-answer pair text data.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
and S1, acquiring the interactive scene text data and generating question-answer pair text data.
And S2, calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus, and determining the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity.
And S3, calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum similarity of the answer text, and determining the question-answer pair text data corresponding to the maximum similarity of the question text.
And S4, comparing the maximum similarity of the question texts with a preset threshold, and if the maximum similarity of the question texts meets the preset threshold, supplementing the question texts in the question-answer pair text data corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpus to serve as the similar questions of the question texts in the corresponding industry question-answer pair corpus.
It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.
In addition, in combination with the method for generating a similarity problem based on the text similarity algorithm in the foregoing embodiments, the embodiments of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the above-described embodiments of a similarity problem generation method based on a text similarity algorithm.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating a similarity problem based on a text similarity algorithm. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In one embodiment, fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 3, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 3. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize a similar problem generation method based on a text similarity algorithm, and the database is used for storing data.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for generating a similarity problem based on a text similarity algorithm is characterized by comprising the following steps:
acquiring interactive scene text data and generating question-answer pair text data;
calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus, and determining the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity;
calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum answer text similarity, and determining the question-answer pair text data corresponding to the maximum question text similarity;
and comparing the maximum similarity of the question texts with a preset threshold, and if the maximum similarity of the question texts meets the preset threshold, supplementing the question texts in the question-answer pair text data corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpus to serve as the similar questions of the question texts in the corresponding industry question-answer pair corpus.
2. The method of claim 1, wherein the obtaining interactive scene text data and generating question-answer pair text data comprises:
acquiring interactive scene text data, and splicing continuous interactive texts in the interactive scene text data according to a time sequence;
and grouping the spliced interactive scene text data according to interactive turns, and combining the user text and the customer service text according to the sequence under each interactive turn to obtain question-answer pair text data.
3. The method of claim 2, wherein before temporally splicing the successive interactive texts in the interactive scene text data, the method comprises:
and performing data cleaning on the interactive scene data, and removing interactive scene text data with the number of interactive rounds being larger than a second preset threshold value.
4. The method according to claim 1, wherein the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and the text similarity between the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity are calculated by a text similarity algorithm; wherein the text similarity is any one of a cosine similarity algorithm, an euclidean distance, a manhattan distance, a minkowski distance, and a chebyshev distance.
5. The method according to claim 2, wherein the calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and determining the industry question-answer pair corpus corresponding to the maximum answer text similarity comprises:
and calculating the text similarity of the answer text of each interactive turn in the question-answer pair text data and the answer text in the industry question-answer pair corpus, and determining the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity of each interactive turn.
6. The method according to claim 5, wherein the calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum answer text similarity and determining the question-answer pair text data corresponding to the maximum question text similarity comprises:
and respectively calculating the text similarity of the question text of each interactive turn in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum value of the answer text similarity of each interactive turn, and determining the question-answer pair text data corresponding to the maximum value of the question text similarity of each interactive turn.
7. The method of claim 6, wherein comparing the maximum similarity of the question text with a preset threshold comprises:
and comparing the maximum value of the similarity of the question texts under each interaction turn with a preset threshold value respectively.
8. A system for generating a similarity problem based on a text similarity algorithm, the system comprising: the system comprises a data acquisition module, an answer text similarity calculation module, a question text similarity calculation module and a similar question extraction module:
the data acquisition module is used for acquiring interactive scene text data and generating question-answer pair text data;
the answer text similarity calculation module is used for calculating the text similarity between the answer text in the question-answer pair text data and the answer text in the industry question-answer pair corpus and determining the industry question-answer pair corpus corresponding to the maximum answer text similarity;
the question text similarity calculation module is used for calculating the text similarity of the question text in the question-answer pair text data and the question text in the industry question-answer pair corpus corresponding to the maximum answer text similarity, and determining the question-answer pair text data corresponding to the maximum question text similarity;
and the similar question extraction module is used for comparing the maximum similarity of the question texts with a preset threshold, and if the preset threshold is met, supplementing the question texts in the question-answer pair text data corresponding to the maximum similarity of the question texts into the corresponding industry question-answer pair corpus to serve as the similar questions of the question texts in the corresponding industry question-answer pair corpus.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the method for generating a similarity problem based on a text similarity algorithm according to any one of claims 1 to 7.
10. A storage medium having stored thereon a computer program, wherein the computer program is arranged to execute the method for generating a similarity problem based on a text similarity algorithm according to any one of claims 1 to 7 when running.
CN202110367263.6A 2021-04-06 2021-04-06 Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm Active CN113139040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110367263.6A CN113139040B (en) 2021-04-06 2021-04-06 Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110367263.6A CN113139040B (en) 2021-04-06 2021-04-06 Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm

Publications (2)

Publication Number Publication Date
CN113139040A true CN113139040A (en) 2021-07-20
CN113139040B CN113139040B (en) 2022-08-09

Family

ID=76811689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110367263.6A Active CN113139040B (en) 2021-04-06 2021-04-06 Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm

Country Status (1)

Country Link
CN (1) CN113139040B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225328A (en) * 2022-06-21 2022-10-21 杭州安恒信息技术股份有限公司 Page access data processing method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170351677A1 (en) * 2016-06-03 2017-12-07 International Business Machines Corporation Generating Answer Variants Based on Tables of a Corpus
CN107562789A (en) * 2017-07-28 2018-01-09 深圳前海微众银行股份有限公司 Knowledge base problem update method, customer service robot and readable storage medium storing program for executing
CN110909165A (en) * 2019-11-25 2020-03-24 杭州网易再顾科技有限公司 Data processing method, device, medium and electronic equipment
CN111177307A (en) * 2019-11-22 2020-05-19 深圳壹账通智能科技有限公司 Test scheme and system based on semantic understanding similarity threshold configuration
CN112527972A (en) * 2020-12-25 2021-03-19 东云睿连(武汉)计算技术有限公司 Intelligent customer service chat robot implementation method and system based on deep learning
CN112527985A (en) * 2020-12-04 2021-03-19 杭州远传新业科技有限公司 Unknown problem processing method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170351677A1 (en) * 2016-06-03 2017-12-07 International Business Machines Corporation Generating Answer Variants Based on Tables of a Corpus
CN107562789A (en) * 2017-07-28 2018-01-09 深圳前海微众银行股份有限公司 Knowledge base problem update method, customer service robot and readable storage medium storing program for executing
CN111177307A (en) * 2019-11-22 2020-05-19 深圳壹账通智能科技有限公司 Test scheme and system based on semantic understanding similarity threshold configuration
CN110909165A (en) * 2019-11-25 2020-03-24 杭州网易再顾科技有限公司 Data processing method, device, medium and electronic equipment
CN112527985A (en) * 2020-12-04 2021-03-19 杭州远传新业科技有限公司 Unknown problem processing method, device, equipment and medium
CN112527972A (en) * 2020-12-25 2021-03-19 东云睿连(武汉)计算技术有限公司 Intelligent customer service chat robot implementation method and system based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225328A (en) * 2022-06-21 2022-10-21 杭州安恒信息技术股份有限公司 Page access data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113139040B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN107992543B (en) Question-answer interaction method and device, computer equipment and computer readable storage medium
CN112131366B (en) Method, device and storage medium for training text classification model and text classification
CN108427707B (en) Man-machine question and answer method, device, computer equipment and storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN110472002B (en) Text similarity obtaining method and device
KR20200064198A (en) Stock prediction method and apparatus by ananyzing news article by artificial neural network model
CN112084789A (en) Text processing method, device, equipment and storage medium
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
CN110019749B (en) Method, apparatus, device and computer readable medium for generating VQA training data
CN113704428A (en) Intelligent inquiry method, device, electronic equipment and storage medium
WO2021114836A1 (en) Text coherence determining method, apparatus, and device, and medium
CN113139040B (en) Method, system, electronic device and storage medium for generating similarity problem based on text similarity algorithm
CN111524043A (en) Method and device for automatically generating litigation risk assessment questionnaire
CN113705207A (en) Grammar error recognition method and device
CN112307754A (en) Statement acquisition method and device
CN111400340A (en) Natural language processing method and device, computer equipment and storage medium
CN113342932B (en) Target word vector determining method and device, storage medium and electronic device
US20220277145A1 (en) Domain Context Ellipsis Recovery for Chatbot
CN114398903A (en) Intention recognition method and device, electronic equipment and storage medium
CN114492450A (en) Text matching method and device
CN113392220A (en) Knowledge graph generation method and device, computer equipment and storage medium
CN113590786A (en) Data prediction method, device, equipment and storage medium
CN111967253A (en) Entity disambiguation method and device, computer equipment and storage medium
CN113434631A (en) Emotion analysis method and device based on event, computer equipment and storage medium
Doval et al. Shallow Recurrent Neural Network for Personality Recognition in Source Code.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 23011, Yuejiang commercial center, No. 857, Xincheng Road, Puyan street, Binjiang District, Hangzhou, Zhejiang 311611

Applicant after: Hangzhou Yuanchuan Xinye Technology Co.,Ltd.

Address before: 23 / F, World Trade Center, 857 Xincheng Road, Binjiang District, Hangzhou City, Zhejiang Province, 310051

Applicant before: Hangzhou Yuanchuan New Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method, system, electronic device, and storage medium for generating similarity problems based on text similarity algorithm

Effective date of registration: 20230509

Granted publication date: 20220809

Pledgee: China Everbright Bank Limited by Share Ltd. Hangzhou branch

Pledgor: Hangzhou Yuanchuan Xinye Technology Co.,Ltd.

Registration number: Y2023980040155

PE01 Entry into force of the registration of the contract for pledge of patent right