CN113806505A

CN113806505A - Element comparison method and device, electronic equipment and storage medium

Info

Publication number: CN113806505A
Application number: CN202111055523.2A
Authority: CN
Inventors: 田鹏; 何春江; 庄纪军; 胡加学; 赵乾
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-12-17
Anticipated expiration: 2041-09-09
Also published as: CN113806505B

Abstract

The invention provides a method and a device for comparing elements, electronic equipment and a storage medium, wherein the method comprises the following steps: determining the audio frequencies of all parties generated by audio interaction; voice transcription is carried out on the audio of each party to obtain an interactive transcription text; based on the semantics of the interactive transcription text, performing element extraction on the interactive transcription text to obtain interactive elements of audio interaction; and performing element comparison based on the interactive elements. According to the method, the device, the electronic equipment and the storage medium, the interactive transcription text is obtained by transcribing the audio voice of each party, and element extraction is carried out based on the semantics of the interactive transcription text, so that the method has good generalization capability, can meet the element extraction requirements in various scenes, fully applies the context of audio interaction, and can ensure the reliability and the accuracy of element extraction in complex scenes. Therefore, element comparison is carried out, errors existing in audio interaction can be found in time, and rapid blocking reminding is carried out, so that the audio interaction quality is improved.

Description

Element comparison method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language understanding, in particular to a method and a device for element comparison, electronic equipment and a storage medium.

Background

In a customer service marketing scenario, the agent typically communicates with the customer on the telephone, during which the agent needs to confirm transaction information related to the purchase of the product with the customer and enter the transaction information into the system. If the transaction information recorded by the agent is inconsistent with the final confirmation of the client, the transaction information is wrong, and adverse results are caused. Therefore, in the process, if an error occurs, the agent needs to be reminded timely.

At present, the element comparison, which is a mode for finding seat errors, can specifically match keywords from respective voice transcription texts of a seat and a customer as elements through a preset keyword matching rule, and compare the elements obtained by rule matching with the elements in the transaction information recorded by the seat to realize consistency check.

However, because the language expression mode has diversity, the keywords can not be enumerated, the generalization of the keyword matching rules is poor, and the detection is often missed and mistakenly detected in practical application. Moreover, the keyword matching rules cannot be modeled based on context scenarios, and ideal element extraction results are difficult to obtain for complex situations of multiple rounds of interaction in practical application.

Disclosure of Invention

The invention provides a method and a device for element comparison, electronic equipment and a storage medium, which are used for solving the problems that element comparison and reference of a keyword matching rule for element extraction is poor in generalization and cannot adapt to the complex situation of multi-round interaction in the prior art.

The invention provides a method for comparing elements, which comprises the following steps:

determining the audio frequencies of all parties generated by audio interaction;

voice transcription is carried out on the audio of each party to obtain an interactive transcription text;

based on the semantic meaning of the interactive transcription text, performing element extraction on the interactive transcription text to obtain an interactive element of the audio interaction;

and performing element comparison based on the interactive elements.

According to an element comparison method provided by the invention, the extracting of elements from the interactive transcription text based on the semantic meaning of the interactive transcription text to obtain the interactive elements of the audio interaction comprises the following steps:

performing sliding window processing on the interactive transcription text to obtain a text sequence comprising at least one sliding window text;

based on the semantics of each sliding window text, respectively extracting elements of each sliding window text to obtain text elements of each sliding window text;

and integrating the text elements of the sliding window texts to obtain the interactive elements of the audio interaction.

According to the element comparison method provided by the invention, the integration of the text elements of the sliding window texts to obtain the interactive elements of the audio interaction comprises the following steps:

updating a previous interactive element based on a text element of a current sliding window text in the text sequence to obtain a current interactive element, and taking a next sliding window text of the current sliding window text in the text sequence as the current sliding window text until the current sliding window text is the last sliding window text in the text sequence;

and taking the final current interactive element as the interactive element of the audio interaction.

According to the element comparison method provided by the invention, the updating of the previous interactive element based on the text element of the current sliding window text in the text sequence to obtain the current interactive element comprises the following steps:

determining a first element value and/or a second element value in a text element of the current sliding window text, wherein the last interactive element comprises an element value of an element name corresponding to the first element value, and the last interactive element lacks an element value of an element name corresponding to the second element value;

and replacing the element value consistent with the element name of the first element value in the previous interactive element based on the first element value, and/or supplementing the second element value into the previous interactive element to obtain the current interactive element.

According to the element comparison method provided by the invention, based on the semantics of each sliding window text, element extraction is respectively carried out on each sliding window text to obtain the text elements of each sliding window text, and the method comprises the following steps:

determining a text sequence increment of a current time period based on a text sequence of a previous time period and a text sequence of the current time period;

and respectively extracting elements of each sliding window text in the text sequence increment based on the semanteme of each sliding window text in the text sequence increment to obtain the text elements of each sliding window text in the text sequence increment.

According to an element comparison method provided by the invention, the voice transcription of the audio of each party is performed to obtain an interactive transcription text, and the method comprises the following steps:

performing voice transcription on the real-time audio of each party in the current time period to obtain a transcribed text of the current time period;

and splicing the transcription texts in the current time period after the interactive transcription texts in the previous time period to obtain the interactive transcription texts in the current time period.

According to the element comparison method provided by the invention, the voice transcription of the real-time audio of each party in the current time period to obtain the transcribed text of the current time period comprises the following steps:

respectively carrying out voice transcription on the real-time audio of each party in the current time period to obtain role transcription texts of each party;

and splicing the role transcription texts of all the parties according to time sequence based on the time intervals of the role transcription texts of all the parties in the corresponding real-time audio to obtain the transcription texts in the current time period.

According to the element comparison method provided by the invention, the element comparison based on the interactive element further comprises the following steps:

and sending an abnormal result generated by the key element comparison to at least one party of the parties to prompt the at least one party to confirm the key elements.

The invention also provides a device for comparing elements, comprising:

the audio determining unit is used for determining the audio of each party generated by audio interaction;

the voice transcription unit is used for carrying out voice transcription on the audio of each party to obtain an interactive transcription text;

the element extraction unit is used for extracting elements from the interactive transcription text based on the semantic meaning of the interactive transcription text to obtain the interactive elements of the audio interaction;

and the element comparison unit is used for comparing elements based on the interactive elements.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of any one of the element comparison methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the element comparison methods described above.

According to the element comparison method, the device, the electronic equipment and the storage medium, the interactive transcription text is obtained by audio voice transcription of each party, element extraction is carried out based on the semantics of the interactive transcription text, the generalization capability is good, the element extraction requirements under various scenes can be met, the context of audio interaction is fully applied, and the reliability and the accuracy of element extraction under complex scenes can be ensured. Therefore, element comparison is carried out, errors existing in audio interaction can be found in time, and rapid blocking reminding is carried out, so that the audio interaction quality is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a prior art element comparison method;

FIG. 2 is a schematic flow chart of a method for comparing elements according to the present invention;

FIG. 3 is a schematic flow chart of step 230 in the element comparison method provided by the present invention;

FIG. 4 is a schematic flow chart of step 232 of the element comparison method provided in the present invention;

FIG. 5 is a schematic flow chart of step 220 in the element comparison method provided by the present invention;

FIG. 6 is a schematic flow chart of step 221 in the element comparison method provided in the present invention;

FIG. 7 is a second schematic flow chart of the element comparison method provided by the present invention;

FIG. 8 is a schematic structural diagram of an element extraction model provided by the present invention;

FIG. 9 is a schematic flow chart of a method for extracting elements provided by the present invention;

FIG. 10 is a schematic structural diagram of an element comparison apparatus provided in the present invention;

fig. 11 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a customer service marketing scenario, the agent typically communicates with the customer on the telephone, and in the process, the agent needs to confirm transaction information related to the purchase of the product, such as purchase amount, age, interest rate and the like, and enter the transaction information into the system. If the transaction information recorded by the agent is inconsistent with the final confirmation of the client, the transaction information is wrong, and adverse results are caused. Therefore, how to timely find problems and quickly block and remind when an agent makes mistakes is particularly important for improving the marketing quality of the agent and reducing customer complaints.

For example, fig. 1 is a schematic flow chart of an element comparison method in the prior art, and as shown in fig. 1, in an audio interaction process of a seat and a customer, online voice transcription can be performed on a seat voice and a customer voice respectively to obtain respective voice transcription texts of the seat and the customer, that is, a seat text and a customer text.

And then, respectively carrying out keyword matching on the seat text and the client text based on a preset keyword matching rule, and taking the matched keywords as elements extracted from the text.

After the elements are extracted from the text, the extracted elements of the text can be compared with the elements recorded by the seat in an element database or a three-party system, and after the element comparison is completed, the result can be fed back to the seat in real time, and particularly, the seat can be fed back to confirm the elements with wrong comparison.

However, because the language expression mode has diversity, the keywords can not be enumerated, the generalization of the keyword matching rules is poor, and the detection is often missed and mistakenly detected in practical application. In addition, for the complex situation of multiple rounds of interaction in practical application, for example, the agent asks for a "ask you for a product with a 5 ten thousand money to be purchased", if the customer answers "yes", the "5 ten thousand money" is a "purchase amount" element that needs to be extracted, and if the customer answers "no", the "5 ten thousand money" is not a "purchase amount" element that needs to be extracted.

In view of the above problems, an embodiment of the present invention provides an element comparison method, which is applicable to a scene, such as an audio customer service and a video customer service, in which all parties need to interact with each other through audio or video and elements involved in an interaction process are recorded. Fig. 2 is a schematic flow chart of the element comparison method provided in the present invention, as shown in fig. 2, the method includes:

step 210, determining the audio of each party generated by the audio interaction.

Specifically, the audio interaction may be an interaction of two parties or three or more parties, for example, an audio or video customer service scene of the two parties interacting, where the two parties of the audio interaction may be an agent and a client, and may also be an audio or video conference scene of the multiple parties participating in the audio or video conference, and the like, which is not specifically limited in the embodiment of the present invention.

In the audio interaction process, all parties participating in the audio interaction process can generate audio, for example, in an audio customer service scene, sound pickup equipment at a seat end can acquire interactive voice of the seat in real time, and sound pickup equipment at a client end can acquire interactive voice of a customer in real time.

Step 220, voice transcription is performed on the audio of each party to obtain an interactive transcription text.

Specifically, after the audio generated by each party of the audio interaction is obtained, the audio generated by each party needs to be subjected to real-time voice transcription, so as to obtain an interactive transcription text containing the whole information of the audio interaction. The interactive transcription text herein includes the transcription text of each party in the audio interaction, for example, in an audio customer service scene, the interactive transcription text includes the transcription text of all voices for the seat and the customer to communicate in the interaction process.

Here, when performing voice transcription, the voice transcription may be performed on the audio of each party respectively to obtain the transcription texts of each party, and then the transcription texts of each party are spliced in combination with the time axes corresponding to the transcription texts of each party, so as to obtain the interactive transcription texts reflecting the information of each party in the audio interaction. Furthermore, when voice transcription is respectively carried out on the audio of each party, the voice transcription model conforming to the role can be applied to carry out voice transcription aiming at the role characteristics of each party, so that the transcribed texts conforming to the expression modes of each party are obtained, splicing is carried out on the basis, and the reliability and the accuracy of the interactive transcribed texts can be ensured.

In addition, when voice transcription is performed, voice detection may be performed on the audio of the other party based on a voice endpoint detection technology, active voices in the audio of each party are integrated on the basis of a time axis of the active voice obtained through detection, and voice transcription is performed on the integrated active voice, so that an interactive transcription text is obtained. Furthermore, when voice transcription is performed on the integrated active voice, the voice transcription model can acquire the overall situation of audio interaction and know the context of the voice, so that voice transcription is performed better, and the reliability and accuracy of the interactive transcribed text are improved.

And step 230, performing element extraction on the interactive transcription text based on the semantic meaning of the interactive transcription text to obtain the interactive elements of the audio interaction.

Specifically, considering that the conventional element extraction mode, namely the element extraction mode based on the keyword matching rule is easy to miss detection and error detection and cannot be modeled based on the context scene, in the embodiment of the invention, the element extraction is performed on the interactive transcription text based on the semantics of the interactive transcription text, so that more reliable and accurate interactive elements are obtained.

Compared with the prior art in which element extraction is performed based on a keyword matching rule, the element extraction method has the advantages that the interactive transcription text can reflect the global information of the currently performed audio interaction only for a single clause or even a single participle, and the problem that the accuracy and reliability of element extraction are affected due to missing of context information in a complex scene is solved.

Compared with the prior art in which element extraction is performed based on a keyword matching rule, element extraction is performed based on the semantics of the interactive transcription text, and the keyword matching rule does not need to be constructed by listing the keyword in advance, so that the problem of poor generalization does not exist, and the method can be better applied to various scenes. The interactive transcription text can reflect complete information of audio interaction, the semantics of the interactive transcription text obtained by performing semantic extraction on the interactive transcription text also covers the context in the audio interaction, and therefore more accurate and reliable interactive elements can be extracted from complex scenes.

Furthermore, element extraction is performed based on the semantics of the interactive transcription text, and the element extraction can be realized through a pre-trained element extraction model, where the element extraction model itself may have the capability of semantic extraction, the input interactive transcription text is subjected to semantic extraction first, then element extraction is performed based on the semantics obtained by extraction, and the element extraction model itself may not have the capability of semantic extraction, and it is necessary to perform semantic extraction on the interactive transcription text through a language model having the capability of semantic extraction first, and then input the extracted semantics into the element extraction model, so that the element extraction model can perform element extraction based on semantics, which is not specifically limited in the embodiment of the present invention. For example, semantic-based element extraction may be implemented by an entity recognition model, and specifically, various preset interactive elements may be used as entities, an interactive transcription text may be used as a model that needs entity recognition, and an interactive element entity included in the interactive transcription text may be recognized by the entity recognition model, so as to obtain the interactive elements.

And 240, performing element comparison based on the interactive elements.

Specifically, when the interactive elements in the audio interaction process are obtained, the interactive elements may be compared with the standard elements in the preset element database, or the interactive elements may be compared with the elements in one or more parties of the audio interaction entered into the system. For example, the interactive element is compared with a standard element in a preset element database, if the interactive element is found to be inconsistent with the standard element through comparison, the element may be an element in which one or more parties in the audio interaction have spoken an error, and the comparison result may be returned to one or more parties in the audio interaction for the parties to confirm and correct the error. For another example, the interactive element is compared with one of the audio interaction or elements in the multi-party logging system, if the interactive element is found to be inconsistent with the logged element through comparison, a logging error may occur, and the comparison result may be returned to the logging party of the element, so that the logging party can confirm and correct errors.

The method provided by the embodiment of the invention obtains the interactive transcription text from the audio voice transcription of each party, performs element extraction based on the semantics of the interactive transcription text, has good generalization capability, can meet the element extraction requirements in various scenes, fully applies the context of audio interaction, and can ensure the reliability and accuracy of element extraction in complex scenes. Therefore, element comparison is carried out, errors existing in audio interaction can be found in time, and rapid blocking reminding is carried out, so that the audio interaction quality is improved.

Based on the above embodiment, the audio of each party generated by the audio interaction determined in step 210 may be real-time audio, that is, audio obtained by real-time recording. The real-time audio covers the audio from the beginning of the audio interaction to the current moment, and the real-time audio is continuously updated along with the time.

Correspondingly, in step 220, voice transcription is performed on the audio of each party to obtain an interactive transcription text, where the audio is a real-time audio, and the interactive transcription text obtained by voice transcription of the real-time audio of each party includes all the voice transcription texts in the audio interaction process, and is also updated continuously over time to become longer.

In

steps

230 and 240, element extraction can be performed on the real-time updated interactive transcription text, so as to realize real-time element comparison, find possible problems in audio interaction in time, and perform rapid blocking reminding, thereby improving the audio interaction quality.

In view of the fact that element extraction based on the semantics of the interactive transcription text is generally limited by the text length of single element extraction, in order to ensure the effect of element extraction, based on the above embodiment, fig. 3 is a schematic flow diagram of step 230 in the element comparison method provided by the present invention, as shown in fig. 3, step 230 includes:

and 231, performing sliding window processing on the interactive transcription text to obtain a text sequence comprising at least one sliding window text.

The interactive transcription text can be understood as an interactive transcription text determined at the current time, wherein the interactive transcription text contains all the voices from the beginning of the audio interaction to the current time, and the current time is changed backwards continuously with the lapse of time, so that the interactive transcription text is updated continuously and becomes longer, and therefore, after the audio interaction lasts for a period of time, the length of the interactive transcription text per se is more likely to exceed the text length limit extracted by a single element.

Specifically, considering the text length limitation of single element extraction, the sliding window processing needs to be performed on the interactive transcription text before the element extraction is performed, the interactive transcription text is divided into a plurality of sliding window texts with the lengths within the text length limitation value through the sliding window processing, each sliding window text has a sequence in a text sequence, and the sequence of the sliding window texts is the sequence of the sliding window passing in the sliding window processing process.

In order to ensure that the length of the sliding window text after the sliding window processing meets the text length limit of the single element extraction, the sliding window length needs to be less than or equal to the text length limit of the single element extraction, for example, the text length of the single element extraction cannot exceed 512 at most, and the sliding window length may be set to 500, or may be set to 512 or 450, and the like. In addition, in order to avoid complete isolation between sliding window texts obtained by sliding windows and influence on element extraction effect, the sliding window step length should be smaller than the sliding window length, so that the mutually overlapped texts exist between two adjacent sliding window texts, and when element extraction is carried out on the sliding window texts in batch, context information can still be referred to through the overlapped part between the sliding window texts, so that the reliability and the accuracy of element extraction are improved. For example, when the length of the sliding window is set to 500, the step length of the sliding window may be set to 50 or 100, etc., and assuming that the total length of the interactive transcription text is 600, the length of the sliding window is 500, and the step length of the sliding window is 50, a text sequence including 3 sliding window texts may be obtained through sliding window processing, where a first sliding window text in the text sequence is 0-499 characters in the interactive transcription text, a second sliding window text is 50-549 characters in the interactive transcription text, and a third sliding window text is 599 characters 100-minus in the interactive transcription text.

And 232, respectively extracting elements of each sliding window text based on the semantics of each sliding window text to obtain the text elements of each sliding window text.

Specifically, after the sliding window processing is completed, element extraction may be performed on each sliding window text batch, where the batch element extraction referred to herein may be understood as performing element extraction separately for a single sliding window text, and the element extraction of each sliding window text may be performed synchronously or sequentially in batches, which is not limited in this embodiment of the present invention.

The element extraction is performed on a single sliding window text, and the element extraction can be realized through a pre-trained element extraction model, wherein the element extraction model can have the semantic extraction capability, firstly perform semantic extraction on the input sliding window text, then perform element extraction based on the extracted semantics, and the element extraction model can also have no semantic extraction capability, and needs to perform semantic extraction on the sliding window text through a language model with the semantic extraction capability, and then input the extracted semantics into the element extraction model, so that the element extraction model can perform element extraction based on the semantics, which is not specifically limited in the embodiment of the invention. The element models used for extracting elements from each slide window text may be the same element extraction model.

And 233, integrating the text elements of the sliding window texts to obtain interactive elements of audio interaction.

Specifically, after obtaining the text elements of each sliding window text, the text elements included in each sliding window text may be integrated, for example, the text elements included in each sliding window text may be all placed in a set, and then redundant removal may be performed on the repeated text elements in the set, or the text elements of each sliding window text may be sequentially integrated from front to back according to the order of each sliding window text in the text sequence, if the text element of the current sliding window text does not appear in the text element of the previous sliding window text, the text element is taken as an interactive element, if the text element of the current sliding window text appears in the text element of the previous sliding window text and is consistent with the previous text element, the text element may be directly ignored, if the text element of the current sliding window text appears in the text element of the previous sliding window text and is different from the element of the previous text element, the same kind of elements in the existing interactive elements may be replaced with new text elements, which is not specifically limited in the embodiment of the present invention.

According to the method provided by the embodiment of the invention, the sliding window processing is carried out on the interactive transcription texts, and the element extraction is carried out on each sliding window text obtained by the sliding window, so that the influence on the element extraction effect caused by the fact that the text is overlong during the element extraction is avoided, and the overlapping part between the sliding window texts ensures that the information of the context can be referred to during the element extraction, thereby improving the reliability and the accuracy of the element extraction.

Based on any of the above embodiments, step 233 includes:

updating the previous interactive element based on the text element of the current sliding window text in the text sequence to obtain the current interactive element, and taking the next sliding window text of the current sliding window text in the text sequence as the current sliding window text until the current sliding window text is the last sliding window text in the text sequence;

and taking the final current interactive element as an interactive element of the audio interaction.

Specifically, in the text sequence obtained by the sliding window processing, there is an overlapping portion between adjacent sliding window texts, so that there may be an overlapping portion for performing element extraction on the same text in the text elements corresponding to each sliding window text, and the overlapping portion for performing element extraction may be embodied in different text elements obtained in different sliding window texts.

In consideration of the fact that text elements extracted from the same part in different sliding window texts may be different, in the embodiment of the present invention, text elements of each sliding window text are integrated based on the ranking of each sliding window text in a text sequence, specifically, an interactive element determined based on a text element of a sliding window text ranked at the top is updated by using a text element of a sliding window text ranked at the bottom. Here, the higher the confidence level of the text elements of the sliding window text with the later default ranking is because, in the audio interaction process, the later, the more comprehensive the information obtained by each party of the audio interaction is, the clearer the ideogram is, the clearer the intention is, and the more accurate the semantic meaning reflected in the sliding window text is, for example, the interactive transcription text is "agent: we have 1 ten thousand products, ask you for you' needs? Customer: but also others? A seat: there are also 5 ten thousand products, which you see? Customer: that i want 1 ten thousand bars. "the purchase amount elements appearing before and after" 1 ten thousand "," 5 ten thousand "and" 1 ten thousand "are included, and the text element" 1 ten thousand "obtained from the last slide window text is the purchase amount element finally determined by the customer. Therefore, for the same type of text elements, the reliability of the text elements obtained by the text extraction with the later sliding window is higher.

Based on this, in the embodiment of the present invention, for a case that a plurality of sliding window texts exist in a text sequence, according to the sequence of each sliding window text in the text sequence, the existing interactive elements are updated one by one from front to back based on the text elements of the sliding window text, so as to obtain new interactive elements. And when the updating of the interactive element is completed based on the text element of the last sliding window text in the text sequence, taking the finally updated interactive element as the interactive element obtained by audio interaction.

Based on any of the above embodiments, in step 233, the updating the previous interactive element based on the text element of the current sliding window text in the text sequence to obtain the current interactive element includes:

and replacing the element value which is consistent with the element name of the first element value in the previous interactive element based on the first element value, and/or supplementing the second element value into the previous interactive element to obtain the current interactive element.

Specifically, the text element or the interactive element may be embodied in the form of "element name — element value", where the element value is an actual value extracted from the text, for example, "1 ten thousand" and "5 ten thousand" are element values, and the element name is a name of an element actually represented by the element value, for example, the element names corresponding to the element values "1 ten thousand" and "5 ten thousand" are "purchase amount".

When updating the previous interactive element for the text element of the current sliding window text, it is first required to determine which element names of element values are already included in the previous interactive element and which element names of element values are not included in the previous interactive element in the text element of the current sliding window text. The element value of the element name included in the text element of the current slide window text but not included in the previous interactive element may be designated as the first element value, and the element value of the element name included in the text element of the current slide window text but included in the text element of the current slide window text may be designated as the second element value. In consideration of practical situations, the text element of the current sliding window text may only include the first element value, may only include the second element value, and may also include both the first element value and the second element value.

In the case where the text element of the current slide window text includes the first element value, the element value corresponding to the element name of the first element value in the previous interactive element may be directly replaced with the first element value, for example, the element value corresponding to the "purchase amount" element in the previous interactive element is "3 ten thousand", the first element value is "5W", the "3W" may be replaced with "5W", and the element value corresponding to the "purchase amount" element in the current interactive element obtained after the replacement is "5 ten thousand".

In the case where the text element of the current slide window text includes the second element value, the second element value may be directly added to the previous interactive element, for example, the "product age" element may not be present in the previous interactive element, and the "2 year" second element value of the "product age" may be directly added to the previous interactive element.

In the case where the text element of the current sliding window text contains both the first element value and the second element value, the current interactive element may be obtained by replacing the element value matching the element name of the first element value in the previous interactive element with the first element value using the first element value and adding the second element value to the previous interactive element.

And then judging whether a next sliding window text exists after the current sliding window text in the text sequence, if so, taking the next sliding window text as the current sliding window text, taking the current interactive element as the previous interactive element, and re-executing the steps until the current sliding window text is the last sliding window text.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of step 232 in the element comparison method provided by the present invention, as shown in fig. 4, step 232 includes:

step 2321, determining the text sequence increment of the current time interval based on the text sequence of the previous time interval and the text sequence of the current time interval;

step 2322, based on the semantics of each sliding window text in the text sequence increment, respectively performing element extraction on each sliding window text in the text sequence increment to obtain a text element of each sliding window text in the text sequence increment.

Specifically, in the audio interaction process, the audio of each party is constantly updated, and the length of the interactive transcription text is also constantly increased along with the advancement of the audio interaction, for example, in an audio customer service scene, the interactive transcription text includes transcription texts of all voices communicated by the seat and the customer in the interaction process, the seat in the first period is speaking, the interactive transcription text in the first period is a1, the customer in the second period is speaking, the interactive transcription texts in the second period are a1 and B1, the seat in the third period is speaking, the interactive transcription texts in the fourth period are a1, B1 and a2, and the interactive transcription text in the fourth period is speaking, the interactive transcription text in the fourth period is a1, B1, a2 and B2.

Specifically, in the interaction process, each time interval needs to perform element extraction on the interactive transcription text of the time interval, where the length of the time interval may be a preset fixed length, or may be a voice length obtained by performing voice endpoint detection on an audio, which is not specifically limited in this embodiment of the present invention.

Considering that the interactive transcription text of the current time period is actually the latest voice transcription text of each party added on the basis of the interactive transcription text of the previous time period, the interactive transcription text of the previous time period and the interactive transcription text of the current time period may have large overlap on the aspect, and although the element extraction may be directly performed on the interactive transcription text of the current time period, this may increase many repetitive works.

Based on this, in the embodiment of the present invention, after the sliding window processing of the interactive transcription text in the current time period is completed to obtain the text sequence in the current time period, the text sequence in the current time period may be compared with the text sequence in the previous time period, so as to determine a plurality of sliding window texts that are newly added to the text sequence in the previous time period in the text sequence in the current time period, that is, the text sequence increment in the current time period.

When element extraction is performed on the interactive transcription text in the current time period, element extraction may be performed only on each sliding window text included in the text sequence increment in the current time period, so as to obtain a text element of each sliding window text in the text sequence increment. On this basis, the text element of each sliding window text in the text sequence of the current time period can be represented as the text element of each sliding window text in the text sequence of the previous time period and the text element of each sliding window text in the text sequence increment of the current time period, and step 233 is executed accordingly to integrate the text elements of each sliding window text, so that the interactive element of the audio interaction of the current time period can be obtained.

According to the method provided by the embodiment of the invention, only the element extraction is carried out on the text sequence increment in the text element extraction of each time interval, so that the calculation amount required by the real-time element extraction is reduced while the element extraction quality is ensured, and the repeated work is avoided.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of step 220 in the element comparison method provided by the present invention, and as shown in fig. 5, step 220 includes:

and step 221, performing voice transcription on the real-time audio of each party in the current time period to obtain a transcribed text in the current time period.

Step 222, splicing the transcription text of the current time period after the interactive transcription text of the previous time period to obtain the interactive transcription text of the current time period.

Specifically, it is considered that in the audio interaction process, the real-time audio of each party is constantly updated, and is reflected in the real-time voice transcription process, and the length of the interactive transcription text is also constantly increased along with the advancement of the audio interaction.

In order to ensure the integrity and comprehensiveness of the interactive elements obtained by subsequently extracting the elements, the interactive transcription text obtained by transcription in each time period needs to contain all interactive information from the beginning of audio interaction to the end of the current time period, so that in the voice transcription process of the real-time audio in the time period, the transcription text in the current time period obtained by voice transcription of the real-time audio in the current time period can be spliced with the interactive transcription text in the previous time period, and the interactive transcription text in the current time period is obtained.

For example, the agent in the first time period speaks, the transcription text at this time is a1, the interactive transcription text is a1, the client in the second time period speaks, the transcription text is B1, and B1 is spliced into the interactive transcription text a1 in the first time period to obtain the interactive transcription texts in the second time period, which are a1 and B1.

Based on any of the above embodiments, step 221 includes:

respectively carrying out voice transcription on the real-time audio of each party in the current time period to obtain a role transcription text of each party;

and splicing the role transcription texts of all the parties according to the time sequence based on the time intervals of the role transcription texts of all the parties in the corresponding real-time audio to obtain the transcription texts in the current time period.

Specifically, when performing voice transcription on the real-time audio, the real-time audio of each party in the current time period can be subjected to voice transcription respectively, so as to obtain role transcription texts of each party, where the role transcription texts reflect the transcription texts of the corresponding speaking parties. After that, the role transcription texts of all parties can be spliced according to the sequence of the time intervals in the time sequence by combining the time intervals occupied by the role transcription texts of all parties on the time axis in the corresponding real-time audio, so that the transcription texts reflecting the information of all parties in the current time interval audio interaction are obtained.

For example, fig. 6 is a schematic flowchart of step 221 in the element comparison method provided by the present invention, as shown in fig. 6, after the voice endpoint detection is performed on the seat audio in the current time period, n segments of character transcription texts from a1 to an in total are obtained through voice transcription, and after the voice endpoint detection is performed on the client audio, m segments of character transcription texts from b1 to bm in total are obtained through voice transcription, on this basis, the time intervals specifically corresponding to a1 to an on the time axis and the time intervals specifically corresponding to b1 to bm on the time axis are combined, that is, a1 to an and b1 to bm can be spliced, so as to obtain the transcription texts in the current time period, such as a1, b1, a2, and b2 ….

Based on any of the above embodiments, after step 240, the method further includes:

and sending an abnormal result generated by the key element comparison to at least one party of all the parties to prompt at least one party to confirm the key elements.

Here, after the element alignment of step 240, the result of the element alignment can be obtained. The result of the element comparison may be reflected as a match or a mismatch, where a match indicates that an element exists in both the comparison parties, and a mismatch indicates that an element exists only in one of the comparison parties, and a case where an element comparison is not matched may be regarded as an abnormality, thereby forming an abnormal result of the element comparison, i.e., a part where the element comparison is not matched.

After the abnormal result is obtained, the abnormal result can be sent to at least one of the parties participating in the audio interaction, for example, in an audio customer service scene, the abnormal result can be sent to the agent, or the abnormal result can be sent to the agent and the customer at the same time to request the confirmation of the two parties, so that the problem can be blocked quickly.

After that, the adjustment interaction element can be updated according to the confirmation result returned by at least one party participating in the audio interaction, so as to realize the correction of the problem.

It should be noted that the execution of the action of sending the abnormal result may be triggered in real time, for example, each time the abnormal result is detected, the sending of the abnormal result is triggered, or may be determined according to the business process, the sending of the abnormal result may be triggered in a specific link, for example, all the abnormal results are sent in a unified manner after the interaction is finished, or when the seat helps the user purchase the product, the purchase button is clicked to trigger the comparison of the interactive elements. Based on any of the above embodiments, fig. 7 is a second schematic flow chart of the element comparison method provided by the present invention, as shown in fig. 7, in an audio customer service scene, the element comparison method may include the following steps:

in the audio interaction process of the seat and the client, online voice transcription can be respectively carried out on the seat voice and the client voice to obtain respective voice transcription texts of the seat and the client, namely the seat text and the client text. Because the voice transcription is executed on line in real time, after the transcription of a section of text is finished, a downstream task is called immediately to extract elements.

After the transfer of a section of text is finished, the newly transferred text is spliced with the previously transferred text, so that an interactive transfer text capable of reflecting all information from the beginning of audio interaction is obtained. Here, as the audio interaction advances, the interactive transcription text will grow longer and longer until the audio interaction terminates, and the interactive transcription text will not grow.

Element extraction can be performed on the updated interactive transfer text, specifically, when element extraction is performed, element extraction can be realized through an element extraction model, the element extraction model can adopt a model framework of BERT (bidirectional Encoder retrieval from transformations) and CRF (Conditional Random Fields), and preferably, the BERT can adopt 6And (3) layer structure. The specific element extraction model may be as shown in fig. 8, and the text "agent: asking you to buy a financial product with a price of 5 ten thousand dollars? Customer: each word of yes can be represented as a corresponding word vector w, thereby forming a word vector sequence of the interactive transcribed text, i.e. w₁,w₂,…,w_tWherein w is_tThe length of the interactive transcription text is t. Inputting the word vector sequence of the interactive transcription text into the BERT to obtain a semantic vector of the interactive transcription text output by the BERT, wherein the semantic vector can be a sequence containing a hidden vector h of each word in the interactive transcription text, namely h₁,h₂,…,h_tWherein h is_tAnd the hidden layer vector of the t-th word in the interactive transcription text is transcribed. And inputting the semantic vector of the interactive transcription text into the CRF to obtain an element extraction result output by the CRF, wherein the element extraction result can be whether each word in the interactive transcription text belongs to an element, and if the word belongs to the element, the type of the element and the position of the element can be realized in the form of entity labeling such as BIO, BIOES and the like. In fig. 8, B indicates the start (Begin) of the "purchase amount" element, E indicates the End (End) of the "purchase amount" element, and O indicates the Outside (out), that is, the character does not belong to an element.

In this process, considering that the element extraction model has a limit on the length of the input text, for example, the length of the text extracted by a single element cannot exceed 512 at most, and in the case that the length of the text exceeds the text length limit extracted by a single element, it is necessary to ensure that the length of the text input to the element extraction model satisfies the text length limit through a sliding window operation. Fig. 9 is a schematic flow chart of the element extraction method provided by the present invention, and as shown in fig. 9, before the interactive transcription text is input into the element extraction model, it needs to be determined whether the length of the interactive transcription text is greater than 512, if so, the interactive transcription text needs to be subjected to sliding window processing, and each of the texts obtained by sliding the window is respectively input into the element extraction model for element extraction, otherwise, the interactive transcription text can be directly input into the element extraction model for element extraction.

After the element extraction is completed, considering that the interactive transcription texts in each time interval are overlapped continuously, a plurality of different element values may exist in the text elements obtained by performing the element extraction on the interactive transcription texts in each time interval under the same element name, and for the above situation, element sorting is required, that is, for the case that a plurality of different element values exist under the same element name, the element value at the last extraction position in the text elements obtained by performing the element extraction at the last time is taken as the standard. For example, the element names extracted from a1 include X1 and X2, the corresponding element values are X1-1 and X2-1, the element names extracted from a1 and B1 include X1, X2 and X3, the corresponding element values are X1-2, X2-2 and X3-1, and the element names extracted from a1, B1 and a2 include X3 and X4, the corresponding element values are X3-2 and X4-1, so that the final extracted element names and values are X1, X2, X3, X4 and X1-2, X2-2, X3-2 and X4-1, where X3 and X4 are based on the third value and X1 and X2 are based on the second value.

And taking the text elements after the element sequencing as interactive elements, and performing element comparison. The triggering of element comparison here may be determined according to a business process, for example, the quality inspection may be compared in real time, that is, the agent performs the quality inspection every time the agent finishes speaking, then the element sorting is completed every time, that is, each time the interactive element is updated, the updated interactive element needs to be compared, for example, the agent clicks a purchase button to trigger the interactive element comparison between the triggering of each process link, for example, when the agent helps the user purchase a product.

After the element comparison is completed, the comparison result can be fed back to the seat in real time, and particularly, the seat can be fed back to confirm the element with the wrong comparison.

Based on any of the above embodiments, fig. 10 is a schematic structural diagram of an element comparison apparatus provided by the present invention, as shown in fig. 10, the element comparison apparatus includes:

an audio determining unit 1010, configured to determine audio of each party generated by audio interaction;

a voice transcription unit 1020, configured to perform voice transcription on the audio of each party to obtain an interactive transcription text;

an element extraction unit 1030, configured to perform element extraction on the interactive transcription text based on the semantic meaning of the interactive transcription text, so as to obtain an interactive element of the audio interaction;

an element comparison unit 1040, configured to perform element comparison based on the interactive elements.

The device provided by the embodiment of the invention obtains the interactive transcription text for the audio voice transcription of each party, performs element extraction based on the semantics of the interactive transcription text, has good generalization capability, can meet the element extraction requirements in various scenes, fully applies the context of audio interaction, and can ensure the reliability and accuracy of element extraction in complex scenes. Therefore, element comparison is carried out, errors existing in audio interaction can be found in time, and rapid blocking reminding is carried out, so that the audio interaction quality is improved.

According to any of the above embodiments, the element extracting unit 1030 includes:

the sliding window subunit is used for performing sliding window processing on the interactive transcription text to obtain a text sequence comprising at least one sliding window text;

the extraction subunit is used for respectively extracting elements of each sliding window text based on the semantics of each sliding window text to obtain the text elements of each sliding window text;

and the integration subunit is used for integrating the text elements of the sliding window texts to obtain the interactive elements of the audio interaction.

In any of the above embodiments, the integrated subunit is configured to:

Based on any of the above embodiments, the extracting subunit is configured to:

Based on any of the above embodiments, the voice transcription unit 1020 includes:

the transcription subunit is used for carrying out voice transcription on the real-time audio of each party in the current time interval to obtain a transcription text of the current time interval;

and the splicing subunit is used for splicing the transcription text of the current time period after the interactive transcription text of the previous time period to obtain the interactive transcription text of the current time period.

Based on any of the above embodiments, the transcription subunit is configured to:

Based on any of the above embodiments, the apparatus further comprises a confirmation unit configured to:

Fig. 11 illustrates a physical structure diagram of an electronic device, and as shown in fig. 11, the electronic device may include: a processor (processor)1110, a communication Interface (Communications Interface)1120, a memory (memory)1130, and a communication bus 1140, wherein the processor 1110, the communication Interface 1120, and the memory 1130 communicate with each other via the communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform a method of element alignment, the method comprising: determining the audio frequencies of all parties generated by audio interaction; voice transcription is carried out on the audio of each party to obtain an interactive transcription text; based on the semantic meaning of the interactive transcription text, performing element extraction on the interactive transcription text to obtain an interactive element of the audio interaction; and performing element comparison based on the interactive elements.

In addition, the logic instructions in the memory 1130 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer being capable of executing the element comparison method provided by the above methods, the method including: determining the audio frequencies of all parties generated by audio interaction; voice transcription is carried out on the audio of each party to obtain an interactive transcription text; based on the semantic meaning of the interactive transcription text, performing element extraction on the interactive transcription text to obtain an interactive element of the audio interaction; and performing element comparison based on the interactive elements.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the element comparison method provided in the above, the method including: determining the audio frequencies of all parties generated by audio interaction; voice transcription is carried out on the audio of each party to obtain an interactive transcription text; based on the semantic meaning of the interactive transcription text, performing element extraction on the interactive transcription text to obtain an interactive element of the audio interaction; and performing element comparison based on the interactive elements.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for element alignment, comprising:

and performing element comparison based on the interactive elements.

2. The element comparison method according to claim 1, wherein the extracting elements from the interactive transcription text based on the semantic meaning of the interactive transcription text to obtain the interactive elements of the audio interaction comprises:

3. The element comparison method according to claim 2, wherein the integrating the text elements of the sliding window texts to obtain the interactive elements of the audio interaction comprises:

4. The method for element comparison according to claim 3, wherein the updating the previous interactive element based on the text element of the current sliding window text in the text sequence to obtain the current interactive element comprises:

5. The element comparison method according to claim 2, wherein the element extraction is performed on each sliding window text based on the semantic meaning of each sliding window text to obtain the text element of each sliding window text, and the method comprises:

6. The element comparison method according to any one of claims 1 to 5, wherein the performing voice transcription on the audio of each party to obtain an interactive transcription text comprises:

7. The element comparison method according to claim 6, wherein the performing voice transcription on the real-time audio of each party in the current time period to obtain the transcribed text of the current time period comprises:

8. The method for element alignment according to any one of claims 1 to 5, wherein the element alignment based on the interactive elements further comprises:

9. An element comparison apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the element comparison method according to any one of claims 1 to 8.

11. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the element comparison method according to any one of claims 1 to 8.