CN113806505B

CN113806505B - Element comparison method, device, electronic apparatus, and storage medium

Info

Publication number: CN113806505B
Application number: CN202111055523.2A
Authority: CN
Inventors: 田鹏; 何春江; 庄纪军; 胡加学; 赵乾
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2024-04-16
Anticipated expiration: 2041-09-09
Also published as: CN113806505A

Abstract

The invention provides an element comparison method, an element comparison device, electronic equipment and a storage medium, wherein the element comparison method comprises the following steps: determining the audio of each party generated by audio interaction; performing voice transcription on the audios of all the parties to obtain an interactive transcription text; based on the semantics of the interactive transcription text, extracting elements from the interactive transcription text to obtain interactive elements of audio interaction; and performing element comparison based on the interaction elements. The method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the interactive transcription text is obtained through audio voice transcription of each party, and element extraction is performed based on the semantics of the interactive transcription text, so that the method, the device, the electronic equipment and the storage medium have good generalization capability, the element extraction requirements in various scenes can be met, the context of audio interaction is fully applied, and the reliability and the accuracy of element extraction in complex scenes can be ensured. Therefore, element comparison is carried out, errors in audio interaction can be found in time, quick blocking reminding is carried out, and accordingly audio interaction quality is improved.

Description

Element comparison method, device, electronic apparatus, and storage medium

Technical Field

The present invention relates to the field of natural language understanding technologies, and in particular, to an element comparison method, an element comparison device, an electronic device, and a storage medium.

Background

In a customer service marketing scenario, the agent typically communicates with the customer by telephone, during which the agent needs to confirm transaction information related to the purchase of the product with the customer and enter the transaction information into the system. If the transaction information recorded by the seat is inconsistent with the final confirmation of the client, the transaction information is wrong, and adverse effects are caused. Therefore, in the process, if errors occur, the seat needs to be timely reminded.

At present, the element comparison is used for discovering the agent error, specifically, the key words can be matched from the voice transcription texts of the agent and the client through the preset key word matching rules, and the element obtained by the rule matching is compared with the element in the transaction information input by the agent, so that consistency check is realized.

However, because the language expression mode has diversity, keywords can not be enumerated, the keyword matching rule has poor generalization, and the detection is frequently missed and misplaced in the practical application. Moreover, the keyword matching rule cannot be modeled based on the context scene, and an ideal element extraction result is difficult to obtain aiming at the complex situation of multi-round interaction in practical application.

Disclosure of Invention

The invention provides a factor comparison method, a factor comparison device, electronic equipment and a storage medium, which are used for solving the problems that in the prior art, factor extraction is carried out on a factor comparison reference keyword matching rule, generalization is poor, and the complex situation of multi-round interaction cannot be adapted.

The invention provides an element comparison method, which comprises the following steps:

determining the audio of each party generated by audio interaction;

performing voice transcription on the audios of all the parties to obtain an interactive transcription text;

based on the semantics of the interactive transcription text, extracting elements from the interactive transcription text to obtain interactive elements of the audio interaction;

and performing element comparison based on the interaction elements.

According to the element comparison method provided by the invention, the element extraction is carried out on the interactive transcription text based on the semantics of the interactive transcription text to obtain the interactive element of the audio interaction, which comprises the following steps:

carrying out sliding window processing on the interactive transcription text to obtain a text sequence comprising at least one sliding window text;

based on the semantics of each sliding window text, extracting elements of each sliding window text to obtain text elements of each sliding window text;

and integrating the text elements of the sliding window texts to obtain the interactive elements of the audio interaction.

According to the element comparison method provided by the invention, the text elements of the sliding window texts are integrated to obtain the interactive elements of the audio interaction, and the method comprises the following steps:

updating a previous interaction element based on a text element of a current sliding window text in the text sequence to obtain a current interaction element, and taking a next sliding window text of the current sliding window text in the text sequence as the current sliding window text until the current sliding window text is the last sliding window text in the text sequence;

and taking the final current interaction element as the interaction element of the audio interaction.

According to the element comparison method provided by the invention, the updating of the last interaction element based on the text element of the current sliding window text in the text sequence to obtain the current interaction element comprises the following steps:

determining a first element value and/or a second element value in a text element of the current sliding window text, wherein the last interaction element comprises an element value corresponding to an element name of the first element value, and the last interaction element lacks an element value corresponding to an element name of the second element value;

and replacing an element value which is consistent with the element name of the first element value in the last interaction element based on the first element value, and/or supplementing the second element value into the last interaction element to obtain the current interaction element.

According to the element comparison method provided by the invention, the element extraction is respectively carried out on each sliding window text based on the semantics of each sliding window text to obtain the text element of each sliding window text, and the element comparison method comprises the following steps:

determining a text sequence increment of a current time period based on a text sequence of a previous time period and a text sequence of the current time period;

and extracting elements of each sliding window text in the text sequence increment based on the semantics of each sliding window text in the text sequence increment, so as to obtain text elements of each sliding window text in the text sequence increment.

According to the element comparison method provided by the invention, the voice transcription is carried out on the audios of all the parties to obtain the interactive transcription text, and the method comprises the following steps:

performing voice transcription on the real-time audio of each party in the current period to obtain a transcription text in the current period;

and splicing the transfer text in the current period after the interactive transfer text in the previous period to obtain the interactive transfer text in the current period.

According to the element comparison method provided by the invention, the real-time audio of each party in the current period is subjected to voice transcription to obtain the transcribed text in the current period, and the method comprises the following steps:

Respectively carrying out voice transcription on the real-time audio of the current time period of each party to obtain character transcription text of each party;

and splicing the character transcription texts of all the parties according to the time sequence based on the time interval of the character transcription texts of all the parties in the corresponding real-time audio, so as to obtain the transcription text of the current time period.

According to the element comparison method provided by the invention, the element comparison is performed based on the interaction element, and then the method further comprises the following steps:

and sending an abnormal result generated by element comparison to at least one of the parties to prompt the at least one party to confirm the elements.

The invention also provides an element comparison device, which comprises:

an audio determining unit for determining the audio of each party generated by the audio interaction;

the voice transcription unit is used for carrying out voice transcription on the audios of all the parties to obtain an interactive transcription text;

the element extraction unit is used for extracting elements from the interactive transcription text based on the semantics of the interactive transcription text to obtain interactive elements of the audio interaction;

and the element comparison unit is used for comparing the elements based on the interaction elements.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of any one of the element comparison methods described above are realized when the processor executes the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the element comparison methods described above.

The element comparison method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the interactive transcription text is obtained by transcribing the audio voice of each party, the element extraction is performed based on the semantics of the interactive transcription text, the generalization capability is good, the element extraction requirements in various scenes can be met, the context of audio interaction is fully applied, and the reliability and the accuracy of the element extraction in complex scenes can be ensured. Therefore, element comparison is conducted, errors in audio interaction can be found in time, quick blocking reminding is conducted, and accordingly audio interaction quality is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow diagram of a prior art element alignment method;

FIG. 2 is a schematic flow chart of an element comparison method provided by the invention;

FIG. 3 is a flow chart of step 230 in the element comparison method provided by the present invention;

FIG. 4 is a flow chart of step 232 in the element comparison method provided by the present invention;

FIG. 5 is a flow chart of step 220 in the element comparison method provided by the present invention;

FIG. 6 is a schematic flow chart of step 221 in the element comparison method provided by the present invention;

FIG. 7 is a second flow chart of the element comparison method provided by the invention;

FIG. 8 is a schematic diagram of the structure of the element extraction model provided by the present invention;

FIG. 9 is a flow chart of the element extraction method provided by the invention;

FIG. 10 is a schematic diagram of a component alignment apparatus according to the present invention;

fig. 11 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In customer service marketing scenarios, the agent typically communicates with the customer by telephone, during which the agent needs to confirm transaction information, such as purchase amount, age, and interest rate, associated with the purchased product with the customer, and enter the transaction information into the system. If the transaction information recorded by the seat is inconsistent with the final confirmation of the client, the transaction information is wrong, and adverse effects are caused. Therefore, how to find out problems in time and perform quick blocking reminding when the seat makes mistakes is particularly important to improving the marketing quality of the seat and reducing customer complaints.

For example, fig. 1 is a schematic flow chart of an element comparison method in the prior art, as shown in fig. 1, in an audio interaction process of an agent and a client, online speech transcription can be performed on the agent speech and the client speech respectively, so as to obtain speech transcription texts of the agent and the client respectively, namely, an agent text and a client text.

Thereafter, keyword matching may be performed on the agent text and the client text, respectively, based on a preset keyword matching rule, and the matched keywords may be used as elements extracted from the text.

After the elements are extracted from the text, the elements obtained by extracting the text can be compared with elements recorded by the agents in an element database or a three-party system, after the element comparison is completed, the results are fed back to the agents in real time, and especially the agents with wrong comparison are fed back for confirmation.

However, because the language expression mode has diversity, keywords can not be enumerated, the keyword matching rule has poor generalization, and the detection is frequently missed and misplaced in the practical application. In addition, for the complex situation of multiple rounds of interaction in practical application, for example, the agent inquires that "ask you are 5 ten thousand products to be purchased", if the client answers "yes", the "5 ten thousand" is the "purchase amount" element to be extracted, if the client answers "not", the "5 ten thousand" is not the "purchase amount" element to be extracted, because keyword matching rules cannot be modeled based on context, ideal element extraction results are difficult to obtain.

Aiming at the problems, the embodiment of the invention provides an element comparison method which can be applied to audio customer service, video customer service and other scenes which need all parties to interact through audio or video and record elements involved in the interaction process. FIG. 2 is a schematic flow chart of an element comparison method provided by the invention, as shown in FIG. 2, the method comprises:

at step 210, the audio of each party generated by the audio interaction is determined.

Specifically, the audio interaction may be interaction of two parties or three parties or even more parties, for example, may be an audio or video customer service scene interacted by two parties, where two parties of the audio interaction may be a seat and a client, or may be an audio or video conference scene participated by multiple parties, and the embodiment of the invention is not limited in this way.

In the audio interaction process, all parties involved in the audio interaction process can generate audio, for example, in an audio customer service scene, pickup equipment at an agent end can collect interaction voice of the agent in real time, and pickup equipment at a client end can collect interaction voice of a client in real time.

And 220, performing voice transcription on the audios of all the parties to obtain interactive transcription texts.

Specifically, after obtaining the audio generated by each party of the audio interaction, real-time speech transcription is required to be performed on the audio generated by each party, so as to obtain the interaction transcription text containing the whole information of the audio interaction. The interactive transcription text comprises transcription text of each interaction in the audio interaction, for example, in an audio customer service scene, and the interactive transcription text comprises transcription text of all voices communicated with the customer by the agent in the interaction process.

When the voice transcription is performed, voice transcription can be performed on the audio of each party respectively, so that the transcribed text of each party is obtained, and then the transcribed text of each party is spliced by combining a time axis corresponding to the transcribed text of each party, so that the interactive transcribed text reflecting the information of each party in the audio interaction is obtained. Further, when voice transcription is performed on the audios of all the parties respectively, a voice transcription model conforming to the role can be applied to perform voice transcription aiming at the role characteristics of all the parties, so that a transcription text conforming to the expression mode of all the parties can be obtained, and the reliability and the accuracy of the interactive transcription text can be ensured by splicing on the basis.

In addition, when the voice transcription is performed, voice detection may be performed on the audio of the opposite party based on the voice endpoint detection technology, the active voices in the audio of each party may be integrated based on the time axis of the detected active voices, and the voice transcription may be performed on the integrated active voices, so as to obtain the interactive transcription text. Further, when the integrated active voice is subjected to voice transcription, the voice transcription model can acquire the overall situation of audio interaction, and the context of the voice is known, so that voice transcription is better performed, and the reliability and accuracy of the interactive transcription text are improved.

And 230, extracting elements from the interactive transcription text based on the semantics of the interactive transcription text to obtain interactive elements of the audio interaction.

Specifically, considering the conventional element extraction mode, namely the element extraction mode based on the keyword matching rule, the method is easy to miss detection and mispheck, and cannot be based on context scene modeling.

Here, when the element extraction is performed by applying the interactive transcription text, compared with the case of performing the element extraction based on the keyword matching rule in the prior art, the interactive transcription text can reflect the information of the currently-in-progress audio interaction global only for single clauses and even single clauses, so that the problem that the accuracy and the reliability of the element extraction are affected due to missing of the context information in a complex scene is avoided.

Compared with the prior art that the element extraction is performed based on the keyword matching rule, the element extraction is performed based on the semantics of the interactive transcription text, and the keyword matching rule is not required to be constructed by pre-instantiating keywords, so that the problem of poor generalization is avoided, and the method can be better applied to various scenes. The interactive transcription text can reflect complete information of audio interaction, semantics of the interactive transcription text obtained by extracting the semantics of the interactive transcription text also covers the context in the audio interaction, so that more accurate and reliable interaction elements can be extracted from complex scenes.

Further, element extraction based on the semantics of the interactive transcription text can be achieved through a pre-trained element extraction model, the element extraction model can have the capability of extracting the semantics, the input interactive transcription text is firstly subjected to semantic extraction, then the element extraction model can not have the capability of extracting the semantics, the interactive transcription text is required to be subjected to semantic extraction through a language model with the capability of extracting the semantics, and the extracted semantics are input into the element extraction model, so that the element extraction model can perform element extraction based on the semantics. For example, element extraction based on semantics can be realized through an entity recognition model, specifically, various preset interactive elements can be used as entities, interactive transcription texts are used as models needing entity recognition, and the entity recognition model is used for recognizing the interactive element entities contained in the interactive transcription texts, so that interactive elements are obtained.

Step 240, element comparison is performed based on the interactive elements.

Specifically, when the interactive element in the audio interaction process is obtained, the interactive element can be compared with a standard element in a preset element database, or the interactive element can be compared with an element in a certain party or a plurality of parties in the audio interaction in a system. For example, the interactive element is compared with the standard element in the preset element database, if the interactive element is inconsistent with the standard element, it may be that one or more parties in the audio interaction speak the wrong element, and the comparison result may be returned to one or more parties in the audio interaction for confirmation and error correction by each party. For another example, the interactive element is compared with an element in one or more parties in the audio interaction or in a multiparty recording system, if the comparison finds that the interactive element is inconsistent with the recorded element, a recording error may occur, and a comparison result may be returned to the recording party of the element, so that the recording party can confirm and correct the recording.

According to the method provided by the embodiment of the invention, the interactive transcription text is obtained by transcribing the audio voice of each party, and element extraction is performed based on the semantics of the interactive transcription text, so that the method has good generalization capability, the element extraction requirements in various scenes can be met, the context of audio interaction is fully applied, and the reliability and accuracy of element extraction in complex scenes can be ensured. Therefore, element comparison is conducted, errors in audio interaction can be found in time, quick blocking reminding is conducted, and accordingly audio interaction quality is improved.

Based on the above embodiment, the audio of each party for which it is determined in step 210 that the audio interaction is generated may be real-time audio, i.e., the resulting audio is recorded in real-time. The real-time audio here covers audio from the beginning of the audio interaction to the current moment, and is updated continuously over time.

Accordingly, in step 220, the audio of each party is subjected to speech transcription to obtain an interactive transcription text, where the interactive transcription text obtained by speech transcription of the real-time audio of each party includes all speech transcription texts in the audio interaction process, and the interactive transcription text is updated continuously and becomes longer along with the time.

In steps 230 and 240, element extraction can be performed on the interactive transcription text updated in real time, so that real-time element comparison is realized, possible problems in audio interaction are found in time, quick blocking reminding is performed, and accordingly audio interaction quality is improved.

Considering that when element extraction is performed based on the semantics of the interactive transcription text, the text length of single-secondary element extraction is generally limited, in order to ensure the effect of element extraction, based on the above embodiment, fig. 3 is a schematic flow diagram of step 230 in the element comparison method provided by the present invention, as shown in fig. 3, step 230 includes:

and 231, performing sliding window processing on the interactive transcription text to obtain a text sequence comprising at least one sliding window text.

The interactive transcription text is understood to be the interactive transcription text determined at the current moment, wherein the interactive transcription text comprises all the transcription texts of the voices from the beginning of the audio interaction to the current moment, the current moment is changed backwards continuously along with the time, and the interactive transcription text is updated continuously and becomes longer, so that after the audio interaction lasts for a period of time, the length of the interactive transcription text itself is more likely to exceed the text length limit extracted by a single sub-pixel.

Specifically, considering the text length limitation of single secondary element extraction, the interactive transcription text needs to be subjected to sliding window processing before element extraction is performed, the interactive transcription text is divided into a plurality of sliding window texts with lengths within a text length limitation value through the sliding window processing, and each sliding window text is ordered in a text sequence, namely the sequence of the sliding window texts in the sliding window processing process.

In the embodiment of the present invention, in order to ensure that the length of the sliding window text after the sliding window processing meets the text length limit of single secondary element extraction, the sliding window length needs to be less than or equal to the text length limit of single secondary element extraction, for example, the text length of single secondary element extraction cannot exceed 512 at most, and the sliding window length can be set to 500, 512 or 450, or the like. In addition, in order to avoid complete isolation between the sliding window texts obtained by the sliding window and influence element extraction effect, the step length of the sliding window should be smaller than the length of the sliding window, so that the text overlapped with each other exists between two adjacent sliding window texts, and when element extraction is carried out on the sliding window texts in batches, the information of the context can be still referred to through the overlapped part between the sliding window texts, thereby improving the reliability and accuracy of element extraction. For example, when the sliding window length is set to be 500, the sliding window step length may be set to be 50 or 100, and assuming that the total length of the interactive transcription text is 600, the sliding window length is 500, and the sliding window step length is 50, a text sequence containing 3 sliding window texts may be obtained through sliding window processing, where the first sliding window text in the text sequence is the 0 th to 499 th words in the interactive transcription text, the second sliding window text is the 50 th to 549 th words in the interactive transcription text, and the third sliding window text is the 100 th to 599 th words in the interactive transcription text.

And 232, extracting elements of each sliding window text based on the semantics of each sliding window text, and obtaining text elements of each sliding window text.

Specifically, after the sliding window processing is completed, element extraction may be performed on each sliding window text in batches, and the batch element extraction referred to herein may be understood as element extraction performed independently for a single sliding window text, where element extraction of each sliding window text may be performed synchronously or sequentially in batches, and embodiments of the present invention are not limited in this regard.

The element extraction for a single sliding window text can be realized through a pre-trained element extraction model, the element extraction model can have the capability of extracting the semantics of the input sliding window text, the element extraction is performed based on the extracted semantics, the element extraction model can also have no capability of extracting the semantics, the sliding window text needs to be subjected to the semantic extraction through a language model with the capability of extracting the semantics, and the extracted semantics are input into the element extraction model, so that the element extraction model can perform the element extraction based on the semantics. The element model used for element extraction for each sliding window text may be the same element extraction model.

And 233, integrating text elements of each sliding window text to obtain interactive elements of audio interaction.

Specifically, after the text elements of each sliding window text are obtained, the text elements included in each sliding window text may be integrated, for example, the text elements included in each sliding window text may be placed in a set, and then repeated text elements in the set may be de-redundant, or the text elements of each sliding window text may be integrated sequentially from front to back according to the order of each sliding window text in the text sequence, if the text element of the current sliding window text does not appear in the text elements of the previous sliding window text, the text element may be used as an interaction element, if the text element of the current sliding window text appears in the text elements of the previous sliding window text and coincides with the previous text elements, and if the text element of the current sliding window text appears in the text elements of the previous sliding window text and differs from the previous text elements, the similar elements in the previous interaction element may be replaced with new text elements, which is not limited in particular in the embodiment of the invention.

According to the method provided by the embodiment of the invention, through carrying out sliding window processing on the interactive transcription text and carrying out element extraction on each sliding window text obtained by the sliding window, the influence of the overlong text being directly truncated during element extraction on the element extraction effect is avoided, and the overlapping part between the sliding window texts ensures that the information of the context can be referred during element extraction, so that the reliability and the accuracy of element extraction are improved.

Based on any of the above embodiments, step 233 includes:

updating a previous interaction element based on a text element of a current sliding window text in the text sequence to obtain the current interaction element, and taking a next sliding window text of the current sliding window text in the text sequence as the current sliding window text until the current sliding window text is the last sliding window text in the text sequence;

and taking the final current interaction element as an interaction element of audio interaction.

Specifically, considering that there is an overlapping portion between adjacent window texts in the text sequence obtained by the window processing, there may be a repeated portion of element extraction for the same piece of text among the text elements corresponding to each window text, and the portion of element extraction repeated may be different among the text elements obtained in different window texts.

Considering that text elements extracted for the same part may be different in different sliding window texts, in the embodiment of the invention, based on the sequence of each sliding window text in the text sequence, the text elements of each sliding window text are integrated, specifically, text elements of the sliding window text with a later sequence are applied, and interactive elements determined based on text elements of the sliding window text with a earlier sequence are updated. Here, the reliability of text elements of the sliding window text with the later default ordering is higher, because in the process of audio interaction, the more comprehensive the information acquired by each party of the audio interaction is, the clearer the ideas are, the more clear the intention is, and the more accurate the semantics reflected in the corresponding sliding window text are, for example, the interactive transcription text is "seat: do we have a 1-ten thousand product asking you for? And (3) a client: and others? Seat: there are also 5 tens of thousands of products, what kind of? And (3) a client: that me would want 1 ten thousand bars. Among the "purchase amount elements, the" 1 ten thousand "," 5 ten thousand "," 1 ten thousand "are the purchase amount elements finally determined by the customer, and the text element" 1 ten thousand "obtained by the last sliding window text. It can be seen that the reliability of text elements extracted from the sliding window text is higher for the text elements of the same type.

Based on the above, the embodiment of the invention updates the existing interactive elements one by one based on the text elements of the sliding window texts from front to back according to the ordering of the sliding window texts in the text sequence aiming at the situation that a plurality of sliding window texts exist in the text sequence, thereby obtaining new interactive elements. After the updating of the interactive element is completed based on the text element of the last sliding window text in the text sequence, the interactive element which is finally updated can be used as the interactive element obtained by audio interaction.

Based on any one of the above embodiments, in step 233, updating the previous interaction element based on the text element of the current sliding window text in the text sequence to obtain the current interaction element includes:

and replacing the element value which is consistent with the element name of the first element value in the previous interaction element based on the first element value, and/or supplementing the second element value into the previous interaction element to obtain the current interaction element.

Specifically, the text element or the interactive element may be represented in the form of "element name-element value", where the element value is an actual value extracted from the text, for example, "1 ten thousand" and "5 ten thousand" are element values, and the element name is the name of the element actually represented by the element value, for example, the element names corresponding to the element values "1 ten thousand" and "5 ten thousand" are "purchase amounts".

When updating the last interactive element with respect to the text element of the current slide window text, it is first necessary to determine which element values of the text element of the current slide window text have element names already included in the last interactive element and which element values have element names not included in the last interactive element. The element value of the previous interactive element and the element name included in the text element of the current slide window text may be recorded as a first element value, and the element value of the previous interactive element and the element name not included in the text element of the current slide window text may be recorded as a second element value. In consideration of practical situations, the text element of the current sliding window text may include only the first element value, may include only the second element value, and may also include both the first element value and the second element value.

For the case that the text element of the current sliding window text contains the first element value, the element value which is consistent with the element name of the first element value in the previous interaction element can be directly replaced by the first element value, the element value corresponding to the element of the "purchase amount" in the previous interaction element is 3 ten thousand, the first element value is 5W, the element value corresponding to the element of the "purchase amount" in the previous interaction element can be replaced by 3W, and the element value corresponding to the element of the "purchase amount" in the current interaction element obtained after the replacement is 5 ten thousand.

For the case that the text element of the current sliding window text contains the second element value, the second element value can be directly added to the previous interaction element, and the second element value of the product year can be directly added to the previous interaction element after the product year is not existed in the previous interaction element.

For the case that the text element of the current sliding window text contains the first element value and the second element value at the same time, the first element value can be applied to replace the element value which is consistent with the element name of the first element value in the previous interaction element, and the second element value is added into the previous interaction element, so that the current interaction element is obtained.

And then, judging whether a next sliding window text exists after the current sliding window text in the text sequence, if so, taking the next sliding window text as the current sliding window text, taking the current interaction element as the previous interaction element, and re-executing the steps until the current sliding window text is the last sliding window text.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of step 232 in the element comparison method provided by the present invention, and as shown in fig. 4, step 232 includes:

step 2321, determining a text sequence increment of the current period based on the text sequence of the previous period and the text sequence of the current period;

step 2322, based on the semantics of each sliding window text in the text sequence increment, extracting elements of each sliding window text in the text sequence increment, so as to obtain text elements of each sliding window text in the text sequence increment.

Specifically, in the audio interaction process, the audio of each party is continuously updated, the length of the interactive transcription text is also continuously increased along with the advancement of the audio interaction, for example, in an audio customer service scene, the interactive transcription text comprises all the transcription texts of all the voices communicated with the customer by the agent in the interaction process, the agent in the first period is speaking, the interactive transcription text at the moment is A1, the customer in the second period is speaking, the interactive transcription text at the moment is A1, B1, A2, the interactive transcription text at the moment is A2, and the interactive transcription text at the moment is A1, B1, A2, and B2.

In the interaction process, elements of the interactive transcription text of each period need to be extracted, and the period length can be a preset fixed length or a voice length obtained by voice endpoint detection of the audio.

Considering that the interactive transcription text in the current period is actually added with the latest voice transcription text of each party based on the interactive transcription text in the previous period, the interactive transcription text in the previous period and the interactive transcription text in the current period may have larger overlapping in terms of space, and although element extraction can be directly performed on the interactive transcription text in the current period, a lot of repeated work is added in this way.

Based on the above, in the embodiment of the invention, after the sliding window processing of the interactive transcription text in the current period is completed, the text sequence in the current period and the text sequence in the last period can be compared after the text sequence in the current period is obtained, so that a plurality of sliding window texts which are newly added in the text sequence in the current period and are higher than the text sequence in the last period, namely the text sequence increment in the current period, are determined.

When element extraction is performed on the interactive transcription text of the current period, element extraction can be performed only on each sliding window text contained in the text sequence increment of the current period, so that text elements of each sliding window text in the text sequence increment can be obtained. Based on this, the text elements of each sliding window text in the text sequence of the current period can be expressed as the text elements of each sliding window text in the text sequence of the previous period and the text elements of each sliding window text in the text sequence increment of the current period, so that step 233 is executed to integrate the text elements of each sliding window text, and thus the interactive elements of the audio interaction of the current period can be obtained.

According to the method provided by the embodiment of the invention, in the text element extraction of each period, element extraction is only carried out on text sequence increment, so that the quality of element extraction is ensured, the calculation amount required by real-time element extraction is reduced, and the repetitive work is avoided.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of step 220 in the element comparison method provided by the present invention, and as shown in fig. 5, step 220 includes:

step 221, performing voice transcription on the real-time audio of each party in the current period to obtain a transcribed text in the current period.

Step 222, splicing the transcribed text of the current period of time and the interactive transcribed text of the previous period of time to obtain the interactive transcribed text of the current period of time.

Specifically, consider that in the audio interaction process, real-time audio of each party is continuously updated, but reflected in the process of real-time speech transcription, the length of the interactive transcription text is also continuously increasing along with the advancement of audio interaction.

In order to ensure the integrity and comprehensiveness of the interactive elements obtained by extracting the elements subsequently, the interactive transcription text obtained by transcription in each period needs to contain all the interactive information from the beginning to the end of the current period, so that the transcription text in the current period obtained by carrying out voice transcription on the real-time audio in the current period can be spliced in the voice transcription process of the real-time audio in the time period, thereby obtaining the interactive transcription text in the current period.

For example, when a first period of the agent is speaking, the transfer text at the moment is A1, the interactive transfer text is A1, a second period of the client is speaking, the transfer text at the moment is B1, and after the B1 is spliced in the interactive transfer text A1 of the first period, the interactive transfer text of the second period is A1 and B1.

Based on any of the above embodiments, step 221 includes:

respectively carrying out voice transcription on real-time audio of each party in the current period to obtain character transcription text of each party;

and splicing the character transfer texts of all the parties according to the time sequence based on the time interval of the character transfer texts of all the parties in the corresponding real-time audio, so as to obtain the transfer text of the current time period.

Specifically, when the real-time audio is subjected to voice transcription, the real-time audio of each party in the current period can be subjected to voice transcription respectively, so that character transcription texts of each party are obtained, and the character transcription texts reflect transcription texts of corresponding speaking parties. And then, combining time intervals occupied by the character transcription texts of all the parties on the time axes in the corresponding real-time audio, and splicing the character transcription texts of all the parties according to the time interval sequencing in time sequence, so as to obtain the transcription texts reflecting the information of all the parties in the audio interaction in the current period.

For example, fig. 6 is a schematic flow chart of step 221 in the element comparison method provided by the present invention, as shown in fig. 6, after performing voice endpoint detection on the seat audio in the current period, a 1-an n-segment character transcription text is obtained through voice transcription, after performing voice endpoint detection on the client audio, b 1-bm m-segment character transcription text is obtained through voice transcription, on the basis, a 1-an and b 1-bm can be spliced by combining a time interval specifically corresponding to a 1-an on a time axis and a time interval specifically corresponding to b 1-bm on a time axis, so as to obtain a1, b1, a2, b2 … of the transcription text in the current period.

Based on any of the above embodiments, after step 240, further includes:

Here, after the element comparison in step 240, the result of the element comparison can be obtained. The result of element alignment may be reflected as coincidence or non-coincidence, meaning that the element exists in both of the alignment parties, and non-coincidence means that the element exists only in one of the alignment parties, and the case where the element alignment is non-coincidence may be regarded as abnormality, thereby forming an abnormal result of the element alignment, that is, a portion where the element alignment is non-coincident.

After the abnormal result is obtained, the abnormal result can be sent to at least one of the parties participating in the audio interaction, for example, in an audio customer service scene, the abnormal result can be sent to the seat, or the abnormal result can be simultaneously sent to the seat and the client, and confirmation of the two parties is requested, so that the problem is quickly blocked.

After that, the interactive elements may be updated and adjusted according to the confirmation result returned by at least one party participating in the audio interaction, so as to implement correction of the problem.

It should be noted that, the execution of the action of sending the abnormal result may be triggered in real time, for example, the abnormal result is obtained by detecting each time, and the abnormal result sending may be triggered, or may be determined according to a service flow, and the abnormal result sending may be triggered in a specific link, for example, all the abnormal results are uniformly sent after the interaction is finished, or when the agent helps the user to purchase a product, the interactive element comparison is triggered by clicking the purchase button. Based on any of the above embodiments, fig. 7 is a second flow chart of the element comparison method provided by the present invention, as shown in fig. 7, in an audio customer service scenario, the element comparison method may include the following steps:

In the audio interaction process of the seat and the client, online voice transcription can be carried out on the seat voice and the client voice respectively, so that respective voice transcription texts of the seat and the client, namely a seat text and a client text, are obtained. Because the voice transcription is performed online in real time, after the transcription of a text segment is completed, a downstream task is called to extract elements.

After the transcription of a text segment is finished, splicing the text obtained by new transcription with the text obtained by previous transcription, so as to obtain the interactive transcription text capable of reflecting all information from the beginning of audio interaction. Here, as the audio interaction advances, the interactive transcription text becomes longer and longer until the audio interaction is terminated, and the interactive transcription text does not grow.

For the updated interactive transcription text, element extraction can be performed, specifically, when element extraction is performed, element extraction can be realized through an element extraction model, wherein the element extraction model can adopt a model framework of BERT (Bidirectional Encoder Representation from Transformers) +CRF (Conditional Random Fields, conditional random field), and preferably, BERT can adopt a 6-layer structure. The specific element extraction model may be as shown in fig. 8, where the text "agent" is interactively transcribed: do you ask you to buy a financial product with an amount of 5 ten thousand yuan? And (3) a client: each word of yes' may be represented as a corresponding word vector w, thereby forming a word vector sequence of interactive transcription text, i.e. w ₁ ,w ₂ ,…,w _t Wherein w is _t And (3) for the word vector of the t-th word in the interactive transcription text, the length of the interactive transcription text is t. Words of text to be interactively transcribedThe vector sequence is input into the BERT to obtain a semantic vector of the interactive transcription text output by the BERT, wherein the semantic vector can be specifically a sequence containing hidden layer vector h of each word in the interactive transcription text, namely h ₁ ,h ₂ ,…,h _t Wherein h is _t Is the hidden layer vector of the t-th word in the interactive transcription text. The semantic vector of the interactive transcription text is input to the CRF to obtain an element extraction result output by the CRF, wherein the element extraction result can be whether each word in the interactive transcription text belongs to an element or not, and if the word belongs to the element, the specific type of the element and the form of the position in the element can be realized in an entity labeling mode such as BIO, BIOES and the like. In fig. 8, B represents the start (Begin) of the "purchase amount" element, E represents the End (End) of the "purchase amount" element, and O represents the Outside (outlide), i.e., the character does not belong to the element.

In this process, considering that the element extraction model has a limitation on the length of the input text, for example, the length of the text extracted by a single secondary element cannot exceed 512 at the highest, and in the case that the length of the text exceeds the length limitation of the text extracted by a single secondary element, it is required to ensure that the length of the text input to the element extraction model meets the length limitation of the text through a sliding window operation. Fig. 9 is a schematic flow chart of the element extraction method provided by the present invention, as shown in fig. 9, before inputting the interactive transcription text into the element extraction model, it is required to determine whether the length of the interactive transcription text is greater than 512, if so, it is required to perform sliding window processing on the interactive transcription text, and input each text obtained by the sliding window into the element extraction model to perform element extraction, otherwise, it is possible to directly input the interactive transcription text into the element extraction model to perform element extraction.

After the element extraction is completed, considering that the interactive transcription text of each period is continuously overlapped, in the text element obtained by extracting the element from the interactive transcription text of each period, a plurality of different element values may exist under the same element name, and for the above case, element ordering needs to be performed, that is, for the case that a plurality of different element values exist under the same element name, the last element value at the extraction position in the text element obtained by performing the element extraction last time is taken into consideration. For example, the element names obtained by extracting A1 include X1, X2, the corresponding element values are X1-1, X2-1, the element names obtained by extracting A1, B1 include X1, X2, X3, the corresponding element values are X1-2, X2-2, X3-1, the element names obtained by extracting A1, B1, A2 include X3, X4, the corresponding element values are X3-2, X4-1, and then the element names and the values are finally extracted, respectively, X1, X2, X3, X4 and X1-2, X2-2, X3-2, X4-1, wherein X3, X4 are the values of the third time, and X1, X2 are the values of the second time.

And taking the text elements after the element ordering as interactive elements, and comparing the elements. The triggering of element comparison can be determined according to a business process, for example, real-time comparison quality inspection is performed, that is, the agent performs quality inspection every time the agent finishes speaking, and then element sorting is completed every time the interactive elements are updated, that is, the updated interactive elements need to be compared, for example, the interactive element comparison is triggered by clicking a purchase button when the agent helps a user to purchase a product between each process link triggering.

After element comparison is completed, the comparison result can be fed back to the seat in real time, and especially the incorrect element comparison can be aimed at, and the seat can be fed back for confirmation.

Based on any of the above embodiments, fig. 10 is a schematic structural diagram of an element comparison device provided by the present invention, and as shown in fig. 10, the element comparison device includes:

an audio determining unit 1010 for determining audio of each party generated by the audio interaction;

the voice transcription unit 1020 is configured to perform voice transcription on the audio of each party to obtain an interactive transcription text;

element extraction unit 1030, configured to extract elements from the interactive transcribed text based on the semantics of the interactive transcribed text, so as to obtain interactive elements of the audio interaction;

element comparison unit 1040 is configured to perform element comparison based on the interaction element.

The device provided by the embodiment of the invention has the advantages that the interactive transcription text is obtained by transcribing the audio voice of each party, and the element extraction is performed based on the semantics of the interactive transcription text, so that the device has good generalization capability, can meet the element extraction requirements in various scenes, fully applies the context of audio interaction, and can ensure the reliability and the accuracy of element extraction in complex scenes. Therefore, element comparison is conducted, errors in audio interaction can be found in time, quick blocking reminding is conducted, and accordingly audio interaction quality is improved.

Based on any of the above embodiments, the element extraction unit 1030 includes:

the sliding window subunit is used for carrying out sliding window processing on the interactive transcription text to obtain a text sequence comprising at least one sliding window text;

the extraction subunit is used for extracting elements of each sliding window text based on the semantics of each sliding window text to obtain text elements of each sliding window text;

and the integration subunit is used for integrating the text elements of the sliding window texts to obtain the interactive elements of the audio interaction.

Based on any of the above embodiments, the integration subunit is configured to:

Based on any of the above embodiments, the extracting subunit is configured to:

Based on any of the above embodiments, the voice transcription unit 1020 includes:

the transcription subunit is used for carrying out voice transcription on the real-time audio of each party in the current period to obtain a transcription text in the current period;

and the splicing subunit is used for splicing the transfer text in the current period of time and obtaining the interactive transfer text in the current period of time after the interactive transfer text in the previous period of time.

Based on any of the above embodiments, the transcription subunit is configured to:

Based on any of the above embodiments, the apparatus further comprises a confirmation unit configured to:

Fig. 11 illustrates a physical structure diagram of an electronic device, as shown in fig. 11, which may include: processor 1110, communication interface Communications Interface 1120, memory 1130 and communication bus 1140, wherein processor 1110, communication interface 1120 and memory 1130 communicate with each other via communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform an element alignment method comprising: determining the audio of each party generated by audio interaction; performing voice transcription on the audios of all the parties to obtain an interactive transcription text; based on the semantics of the interactive transcription text, extracting elements from the interactive transcription text to obtain interactive elements of the audio interaction; and performing element comparison based on the interaction elements.

Further, the logic instructions in the memory 1130 described above may be implemented in the form of software functional units and sold or used as a stand-alone product, stored on a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the element comparison method provided by the methods described above, the method comprising: determining the audio of each party generated by audio interaction; performing voice transcription on the audios of all the parties to obtain an interactive transcription text; based on the semantics of the interactive transcription text, extracting elements from the interactive transcription text to obtain interactive elements of the audio interaction; and performing element comparison based on the interaction elements.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided element alignment methods, the method comprising: determining the audio of each party generated by audio interaction; performing voice transcription on the audios of all the parties to obtain an interactive transcription text; based on the semantics of the interactive transcription text, extracting elements from the interactive transcription text to obtain interactive elements of the audio interaction; and performing element comparison based on the interaction elements.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An element comparison method, comprising:

determining the audio of each party generated by audio interaction;

performing voice transcription on the audio of each party to obtain an interactive transcription text, wherein the interactive transcription text comprises transcription texts interacted by each party in audio interaction and reflects the global information of the currently-in-process audio interaction;

performing element comparison based on the interaction elements;

the element extraction is performed on the interactive transcription text based on the semantics of the interactive transcription text to obtain the interactive element of the audio interaction, which comprises the following steps:

integrating text elements of the sliding window texts to obtain interaction elements of the audio interaction;

the integrating the text elements of the sliding window texts to obtain the interactive elements of the audio interaction comprises the following steps:

taking the final current interaction element as the interaction element of the audio interaction;

updating the last interaction element based on the text element of the current sliding window text in the text sequence to obtain the current interaction element, including:

2. The element comparison method according to claim 1, wherein the extracting elements from each sliding window text based on the semantics of each sliding window text to obtain text elements of each sliding window text includes:

3. The element comparison method according to any one of claims 1 to 2, wherein the performing speech transcription on the audio of the parties to obtain interactive transcription text includes:

4. The method of claim 3, wherein the step of performing speech transcription on the real-time audio of the current time period of each party to obtain the transcribed text of the current time period includes:

5. The element comparison method according to any one of claims 1 to 2, characterized in that the element comparison based on the interactive element, further comprises thereafter:

6. An element comparison device, comprising:

the voice transcription unit is used for carrying out voice transcription on the audios of all the parties to obtain interactive transcription texts, wherein the interactive transcription texts comprise transcription texts interacted by all the parties in audio interaction and reflect the global information of the currently-in-progress audio interaction;

the element comparison unit is used for comparing elements based on the interaction elements;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the element alignment method of any of claims 1 to 5 when the program is executed.

8. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor performs the steps of the element alignment method of any of claims 1 to 5.