CN109427327B

CN109427327B - Audio call evaluation method, evaluation device, and computer storage medium

Info

Publication number: CN109427327B
Application number: CN201710792860.7A
Authority: CN
Inventors: 厉正吉
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2022-03-08
Anticipated expiration: 2037-09-05
Also published as: CN109427327A

Abstract

The embodiment of the invention discloses an audio call evaluation method, an evaluation device and a computer storage medium. The audio call evaluation effect is applied to an evaluation device and comprises the following steps: obtaining a recognition text of the received audio, wherein the recognition text is as follows: identifying the received audio after the reference audio is transmitted at the transmitting and receiving ends; and combining the identification text and the reference text corresponding to the reference audio to obtain an effect evaluation value of the audio call.

Description

Audio call evaluation method, evaluation device, and computer storage medium

Technical Field

The present invention relates to the field of mobile communications, and in particular, to an audio call evaluation method, an evaluation device, and a computer storage medium.

Background

The current call quality assessment of audio calls is performed by people. On one hand, the evaluation of human participation requires ginseng and consumes a large amount of manpower, and is inefficient; on the other hand, different people have different sensitivities to the audio call, and then different people give different evaluation values to the same call effect, so that manual errors are introduced, and the evaluation result cannot objectively reflect the real call effect.

Disclosure of Invention

In view of the above, embodiments of the present invention are directed to an audio call evaluation method, an evaluation device, and a computer storage medium, which at least partially solve the above problems.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an audio call evaluation effect method, applied to an evaluation device, including:

obtaining a recognition text of the received audio, wherein the recognition text is as follows: identifying the received audio after the reference audio is transmitted at the transmitting and receiving ends;

and combining the identification text and the reference text corresponding to the reference audio to obtain an effect evaluation value of the audio call.

Based on the above scheme, the obtaining an effect evaluation value of an audio call by combining the recognition text and the reference text corresponding to the reference audio includes:

performing sentence segmentation on the identification text to obtain an identification sentence;

combining the recognition sentences with the reference sentences in the reference text to obtain sentence scores;

and determining the effect evaluation value according to the sentence scores of the recognition sentences in the recognition text.

Based on the above scheme, the obtaining an effect evaluation value of an audio call by combining the recognition text and the reference text corresponding to the reference audio further includes:

performing word segmentation processing on the recognition sentences to obtain recognition words;

and calculating the sentence score by combining the recognition words, the reference words in the reference text and the corresponding word weights.

Based on the above scheme, the obtaining of the identification text of the received audio includes:

acquiring an identification text which carries a separation signal and corresponds to the received audio and contains separation characters;

the sentence segmentation is performed on the recognition text to obtain a recognition sentence, and the method comprises the following steps:

and performing sentence segmentation based on the separating characters to obtain the recognition sentence.

acquiring identification texts of different versions of reference audio of the same voice content with different attributes;

the obtaining an effect evaluation value of the audio call by combining the recognition text and the reference text corresponding to the reference audio comprises:

respectively obtaining the evaluation value of the received audio of each version according to the received audio of each version and the reference text corresponding to the reference audio with different attributes;

the final effect evaluation value is obtained based on evaluation values of received audio of different versions of the same speech content.

Based on the above scheme, the method further comprises:

obtaining an evaluation parameter of the reference audio;

and obtaining the effect evaluation value by utilizing the evaluation parameter in combination with the identification text and the reference text corresponding to the reference audio.

Based on the above scheme, the method further comprises:

receiving a reference audio sent by a sending end to obtain a received audio;

the obtaining of the identification text of the received audio comprises:

the received audio is identified and the identified text is obtained.

In a second aspect, an embodiment of the present invention provides an evaluation apparatus, including:

the acquiring unit is used for acquiring an identification text of the received audio, wherein the identification text is as follows: identifying that a receiving end receives a received audio corresponding to a reference audio sent by a sending end;

and the evaluation unit is used for combining the identification text and the reference text corresponding to the reference audio to obtain an effect evaluation value of the audio call.

In a third aspect, an embodiment of the present invention provides an evaluation apparatus, including: a transceiver, a memory, a processor, and a computer program stored on the memory and executed by the processor;

the processor, connected to the transceiver and the memory respectively, is configured to implement the audio call evaluation method provided in any one of claims 1 to 7 by executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a computer program is stored in the computer storage medium; the computer program, when executed, is capable of implementing the audio call evaluation method provided in any one of claims 1 to 7.

The embodiment of the invention provides an audio call evaluation method, evaluation equipment and a computer storage medium, wherein an identification text of a received audio obtained by transmitting and receiving a double-end transmission reference audio is obtained; and then, the identification text and the reference text for generating the reference audio are combined to obtain an effect evaluation value for evaluating the call effect, so that people are not required to participate in the evaluation of the call, evaluation errors caused by subjective factors of people are solved, automatic evaluation among pure devices can be realized, and the evaluation efficiency is improved.

Drawings

Fig. 1 is a schematic flow chart illustrating a first audio call evaluation effect according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating a second audio call evaluation effect according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an evaluation apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another evaluation apparatus provided in the embodiment of the present invention;

fig. 5 is a schematic structural diagram of an evaluation system according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail with reference to the drawings and the specific embodiments of the specification.

As shown in fig. 1, the present embodiment provides an audio call evaluation effect method, applied to an evaluation device, including:

step S110: obtaining a recognition text of the received audio, wherein the recognition text is as follows: identifying the received audio after the reference audio is transmitted at the transmitting and receiving ends;

step S120: and combining the identification text and the reference text corresponding to the reference audio to obtain an effect evaluation value of the audio call.

In this embodiment, the evaluation device may be a dedicated evaluation device in the evaluation system, or may be a receiving end device itself having an evaluation function. When the evaluation device is a dedicated evaluation device, the evaluation device is a physical device independent of the transmitting end and the receiving end.

In the embodiment, firstly, the evaluation equipment acquires an identification text of a received audio; the received audio is an audio signal, and the audio signal can be converted into corresponding text information by a speech recognition program or the like, and such text information is referred to as a recognition text in this embodiment. The recognition text comprises various recognized words and/or characters. The words may be recognized words and/or characters of different types of languages, such as chinese words, chinese characters, english words and characters, dialect words and characters, and the like.

In step S120 of this embodiment, a receiving effect that the reference audio forms a receiving audio after passing through transmission is obtained by combining the recognition text and the reference text.

For example, the reference audio and the reference text are in one-to-one correspondence, and a recognition result after the reference audio is recognized by a speech recognition program is the reference text.

In step S120 in this embodiment, the identification text of the received audio may be matched with the reference text, and an effect evaluation parameter positively correlated to the matching degree may be obtained. For example, the higher the matching degree is, the smaller the loss or distortion of the reference text transmitted from the receiving end to the transmitting end is, the better the call effect is, and the better the call effect represented by the corresponding effect evaluation parameter is. For example, the number of matching characters in the recognized text and the reference text is counted, and the ratio of the number of matching words to the total number of characters in the reference text is calculated to obtain the effect evaluation value. Of course, the specific implementation is not limited to any of the above.

Alternatively, as shown in fig. 2, the step S120 may include:

step S121: performing sentence segmentation on the identification text to obtain an identification sentence;

step S122: combining the recognition sentences with the reference sentences in the reference text to obtain sentence scores;

step S123: and determining the effect evaluation value according to the sentence scores of the recognition sentences in the recognition text.

In this embodiment, the recognition text is divided into sentences, and the sentences are scored one by one, so as to obtain the sentence scores.

The information amount carried by different sentences is different, and the influence degree on the call quality is different. For example, some meaningless moods, even if the influence of loss or recognition error on the conversation effect, are obviously different from the influence degree of a long sentence with a large information amount.

Therefore, in this implementation, the recognition text is segmented according to the segmentation, the recognition text which is a whole paragraph is segmented into individual recognition sentences, and sentence scores are obtained through sentence matching.

And then, synthesizing the corresponding relation between the identification text and the reference text to obtain the sentence score of a single identification sentence, and finally synthesizing all the sentence scores to obtain a sentence score mean value or a median value of the sentence scores, or providing the sentence score mean value after the highest sentence score and the lowest sentence score, and the like, as parameters for representing the effect score.

The sentence score mean can be used as an evaluation value of the final effect of the audio call.

In some embodiments, an exact match based sentence score may be obtained based on the degree of match of sentence-by-sentence matches from the matching reference text and the recognition text of the main sentence. In other embodiments, a sentence score based on fuzzy matching may also be obtained by fuzzy matching of the reference text and the recognized text. In the fuzzy matching, if the recognized text corresponds to a word in which the reference text lacks or has an influence not affecting the substance content, the matching may be considered to be successful. For example, the reference text includes "Nanjing person", and the recognition word that recognizes the text corresponding to the word is "Nanjing person". In summary, there are many ways to obtain the statement score, and the statement score is not limited to any of the above.

The amount of information carried by different words in a sentence is also different, and a sentence may include: in the present embodiment, the recognition terms of different sentence components correspond to different sentence weights, because the information amount carried by the terms corresponding to different sentence components is different, and the information amount is also different if the information amount is lost or is lost due to an error. In general, the larger the amount of information carried by one recognition word, the larger the influence of the sentence weight that may correspond on the effect evaluation value.

For example, the word weight of the subject, predicate, or object is usually greater than the weight of the object, predicate, or object.

In some embodiments, the term weight is a weight given by the device automatically through splitting of the sentence component, and may also be a term weight obtained through training of a learning model of a degree of change in information amount caused by incorrect replacement. In some embodiments, the word weight may be determined from input information received from the human-computer interaction interface, and in short, the word weight may be determined in various ways, not limited to any of the above.

In summary, in the embodiment, not only the influence degrees of different types of recognition sentences on the call effect are distinguished, but also the influence degrees of different words in the recognition sentences on the call result are considered, so that the accurate final call evaluation value can be obtained, and the accurate evaluation effect can be given.

The step S122 may include:

For example, segmentation of recognition sentences is performed based on a bag-of-words method or other modes, or based on semantic recognition or other modes, so as to achieve each recognition word; and then respectively calculating the word evaluation score of each recognition word in one recognition sentence based on the matching degree and the word weight of the recognition word and the reference word in the reference text, and then combining the word evaluation scores to obtain the sentence score. For example, the sentence score is obtained by solving an average score of each recognition term in one recognition sentence, or the sentence score is obtained by solving a sum of products of the recognition degrees and the recognition weights of each recognition term in one recognition sentence.

The word weight may be predetermined. The word weight may be set in consideration of the position of the word in the sentence and/or the played sentence component, as well as in consideration of the error of the speech recognition program itself. When the word weight identifies the reference audio for multiple times, the correct probability and/or the error probability of the identification result of the same word by the voice identification program are counted, and the result is used as one of the reference factors for setting the word weight based on at least one of the correct probability and the error probability.

In this embodiment, the first speech recognition program for recognizing the received audio to obtain the recognized text is the second speech recognition program for recognizing the reference audio to obtain the reference text, and is preferably the same type or the same model or even the same speech recognition program, so as to ensure the accuracy of the recognition result. In other embodiments, in order to obtain an accurate effect score value, the received audio may be recognized using a plurality of different versions and/or different types of speech recognition programs to obtain a plurality of recognition texts; and obtaining the final scoring value by mean value calculation or median value determination and other calculation function processing based on the effect evaluation values respectively corresponding to the plurality of identification results.

In summary, there are many ways to obtain the statement score based on identifying words, and not limited to any of the above.

Optionally, the step S110 may include:

the step S121 may include:

In this embodiment, the reference audio is a reference audio provided with a split signal, and correspondingly, the received audio also carries the received audio of the split signal. The separation character corresponds to the separation signal. The divided audio may be a silence signal in the audio, i.e. an audio signal generated when a speech pause of the reference audio is generated is called a silence signal. In still other embodiments, the separation signal may also be: other specific audio signals, for example telephone dial tones, or audio signals of a specific frequency, in any case, are specific audio signals that can distinguish a normal voice of a person. In some embodiments, the separation signal may be a Dual Tone Multi-Frequency (DTMF) encoded signal.

Thus, when a separator signal is received indicating that a sentence has been completed, a particular separator character is generated when a separator signal in the received audio is encountered during speech recognition.

Therefore, when the evaluation is carried out, the segmentation of the recognition sentences can be simply and conveniently realized based on the separation characters instead of the segmentation based on the modes of voice and the like, so that the evaluation operation amount in the evaluation process can be greatly reduced, the whole evaluation process is simplified, and the evaluation efficiency is improved.

Optionally, the step S110 may include: acquiring identification texts of different versions of reference audio of the same voice content with different attributes;

the step S120 may include: respectively obtaining the evaluation value of the received audio of each version according to the received audio of each version and the reference text corresponding to the reference audio with different attributes; the final effect evaluation value is obtained based on evaluation values of received audio of different versions of the same speech content.

The different attributes can be reference audio of the same voice sent by different genders, reference audio of the same voice sent by different ages, and reference audio of the same voice sent by different main frequency bands. The frequency range that a frequency band can correspond to is limited, and a frequency exceeding a predetermined value in a reference audio is located in the frequency band, and the frequency band is the main frequency band. The predetermined value may be 50%, 60%, 70%, 80%, or the like.

In this embodiment, in order to accurately obtain real values of call effects felt by different users, reference audios of different versions with different attributes of the same voice content are repeatedly transmitted, so as to obtain received audios of multiple versions, further obtain identification texts for receiving the multiple versions, and then combine the identification texts with the reference texts to obtain effect evaluation values of the versions, so that the effect evaluation values can be obtained by integrating the effect evaluation values. The final effect valuation here may be a mean value or a median value of the effect valuations of the respective versions.

Optionally, the method further comprises:

obtaining an evaluation parameter of the reference audio;

the step S120 may include:

The evaluation parameters herein may include: the above statement weight, word weight, and other various information related to the calculation of the effect evaluation value. In some embodiments, the evaluating parameters may further include: evaluating the function; the evaluation function provides operation signs, operation order, and the like of each parameter of the calculated effect evaluation value, which all affect the effect evaluation value.

In some embodiments, the evaluation parameter may be directly the evaluation model, and the evaluation model may be a learning machine model trained by using reference audio and reference text, and/or a model parameter of a neural network, and/or an evaluation model of a big data training model such as a regression forest. The evaluation device may build the evaluation model based on the model parameters, and the evaluation model may obtain the effect evaluation value and the like through black box processing and the like with at least the recognition text and the reference text as input. Of course, this is merely an example, and the specific implementation is not limited thereto.

In this embodiment, the evaluation device may directly be: and receiving the reference audio to obtain a receiving device of the received audio. As such, the method further comprises: receiving a reference audio sent by a sending end; in step S110, the received audio is identified by using an audio identification program, and the identification result is generated by itself.

In some embodiments, the method further comprises:

negotiating with a receiving end to negotiate the sending parameters of the sent reference audio; wherein the transmission parameters may include: one or more of the parameters of audio identification, sending version, sending time and the like of the reference audio; the receiving the reference audio sent by the sending end may include: receiving the reference audio based on the transmission parameters to obtain the received audio; and/or, the method further comprises: and determining the reference audio and/or the reference text according to the audio identification.

In a word, if the evaluation device is directly a receiving device for receiving the reference audio, the integration of receiving and evaluating the reference audio is realized, and the method has the characteristics of simple realization, less related devices and the like.

As shown in fig. 3, the present embodiment provides an evaluation apparatus including:

an obtaining unit 110, configured to obtain an identification text of a received audio, where the identification text is: identifying that a receiving end receives a received audio corresponding to a reference audio sent by a sending end;

the evaluation unit 120 is configured to obtain an effect evaluation value of the audio call by combining the recognition text and the reference text corresponding to the reference audio.

In the present embodiment, an obtaining unit 110 is provided, and the obtaining unit 110 may correspond to a communication interface, obtain the identification result from a receiving end, and the like; or, receiving and recording the received audio, the obtaining unit 110 may receive the received audio and recognize the received audio by using a voice recognition program to obtain the recognition result; or, when the evaluation device is the receiving end, the processor in the evaluation device identifies the received audio by running the speech recognition program and the like, and generates the identification text by itself.

The evaluation unit 120 may correspond to a processor; the processor may be a Central Processing Unit (CPU), a digital signal processor (AP), a Microprocessor (MCU), a Digital Signal Processor (DSP), a programmable logic array (PLC), an Application Specific Integrated Circuit (ASIC), or other processing structures capable of performing information processing and/or computation.

The evaluation unit 120 combines the recognition text and the standard reference text corresponding to the reference audio to obtain an evaluation of the effect of the recognition result of the information loss amount of the audio after transmission.

Optionally, the evaluation unit 120 is configured to perform sentence segmentation on the recognition text to obtain a recognition sentence; combining the recognition sentences with the reference sentences in the reference text to obtain sentence scores; and determining the effect evaluation value according to the sentence scores of the recognition sentences in the recognition text.

In the present embodiment, when performing evaluation, the recognition text is divided into recognition sentences, and the calculation of the effect evaluation value is performed in units of recognition sentences.

Optionally, the evaluation unit 120 is configured to perform word segmentation processing on the recognition sentences to obtain recognition words; and calculating the sentence score by combining the recognition words, the reference words in the reference text and the corresponding word weights.

Optionally, the evaluation unit 120 is further specifically configured to perform word segmentation processing on the recognition statement to obtain a recognition word; and calculating the sentence score by combining the recognition words, the reference words in the reference text and the corresponding word weights.

In order to further obtain an accurate effect evaluation value in this embodiment, the recognition sentence is subjected to word segmentation processing, and the recognition sentence is further split into individual recognition words, where the recognition words may include one character or multiple characters in this embodiment; if one of the recognition terms includes a plurality of characters, the plurality of characters are continuously distributed in the recognition sentence. Then, a sentence score of each recognition sentence is calculated by combining the word weight of the recognition sentence in one recognition sentence, and in some implementations, only the sentence score of part of recognition sentences is calculated, or the sentence splitting of the part of recognition sentences is carried out to obtain the recognition vocabulary. For example, for some kays with the exclamation word only, the recognition sentences do not need to be split, and unnecessary operations are reduced; then obtaining statement scores; and finally obtaining the evaluation value of the call effect of the corresponding received audio after the transmission of the whole reference audio by combining the sentence scores.

Optionally, the obtaining unit 110 is specifically configured to obtain a recognition text that carries a separation signal and includes separation characters and corresponds to the received audio; the recognition unit is specifically configured to perform sentence segmentation based on the separation character to obtain the recognition sentence.

Optionally, the obtaining unit 110 is specifically configured to obtain recognition texts of different versions of reference audios of the same voice content with different attributes; the identification unit is specifically configured to obtain an evaluation value of the received audio of each version according to the received audio of each version and the reference text corresponding to the reference audio of different attributes; the final effect evaluation value is obtained based on evaluation values of received audio of different versions of the same speech content.

Optionally, the evaluation device further comprises:

the evaluation parameter unit is configured to obtain an evaluation parameter of the reference audio, and may specifically correspond to a communication interface, receive the evaluation parameter from another device, or generate the evaluation parameter by itself through processing the reference audio and the reference text.

The evaluation unit 120 is specifically configured to obtain the effect evaluation value by using the evaluation parameter in combination with the recognition text and the reference text corresponding to the reference audio.

In some embodiments, the evaluation device further comprises:

the receiving unit is used for receiving the reference audio sent by the sending end so as to obtain the received audio; the receiving unit may correspond to a receiving antenna for receiving the reference audio. The obtaining unit 110 is specifically configured to identify the received audio and obtain the identification text.

As shown in fig. 4, the present embodiment further provides an evaluation device, which includes a transceiver 230, a memory 210, a processor 220, and a computer program 240 stored in the memory 210 and executed by the processor 220;

the processor 220 is connected to the memory 210 and the transceiver 230, respectively, and is configured to execute an audio call evaluation method provided by any one or more of the above technical solutions by executing the computer program 240, for example, the audio call effect evaluation method shown in fig. 1 and/or fig. 2.

In this embodiment, the transceiver 230 may correspond to a network interface, which may be a cable interface and may be used for data interaction with other network elements.

The memory 210 may include: various types of storage media may be used for data storage. In this embodiment, the memory 210 includes a non-volatile storage medium at least partially, which can be used for storing the computer program 240.

The processor 220 may include: a central processing unit, microprocessor, digital signal processor, application specific integrated circuit, or programmable array, etc., may be used to implement the formation of PNF packets through execution of computer program 240.

In this embodiment, the processor 220 may be connected to the transceiver 230 and the memory 210 via an intra-device bus such as an integrated circuit bus.

An embodiment of the present invention further provides a computer storage medium, where a computer program is stored, and after the computer program is executed, the method for evaluating an audio call effect according to one or more technical solutions applied to an evaluation device described above can be implemented, for example, the method for evaluating an audio call effect as shown in fig. 1 and/or fig. 2.

The computer storage medium provided by the embodiment of the invention comprises: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. Alternatively, the computer storage medium may be a non-transitory storage medium. The non-transitory storage medium herein may also be referred to as a non-volatile storage medium.

Several specific examples are provided below in connection with any of the embodiments described above:

example 1:

the audio call evaluation method provided by the present example includes: preparing corpus, executing test and evaluating quality.

Corpus preparation (including audio and text corresponding to audio) may include the following steps:

step p1. selects a speech recognition program that converts speech output to text output. A speech recognition program may be used to perform speech recognition.

Step p2. prepares a piece of reference audio (referred to as reference audio), and corresponding text (referred to as reference text). The reference audio may be, but is not limited to, a news report. The reference audio may include: the audio of different versions, for example, the male sound version audio, the female sound version audio, the child sound version audio, the official language version audio, and also the dialect version audio, may be played at the normal playing speed of the reference audio for a predetermined time period, for example, about 2 minutes.

Step p3., using voice recognition program to recognize reference audio, if the recognition result is not consistent with the reference text, then go to step p2 to reselect the reference audio;

the step p4. calibrates the reference text, specifically includes steps p41 to p45, where the calibration may be manual calibration or equipment calibration.

Step p41, performing word segmentation on the reference text, and splitting the reference text into a plurality of words;

step p42, assigning a certain word weight to each word, wherein the word weight w is a non-negative real number in the interval of [0,1], and the larger the value is, the more critical the status of the word in the sentence is, and if errors occur, the larger the understanding influence on the whole sentence is; on the contrary, the more inconsequential the position of the word in the sentence is, if the error occurs, the less obvious the understanding influence on the whole sentence is; assuming that the weights of words in a certain sentence are wc1 ', wc2 ', … … and wcn ', respectively, the normalized weights are calculated as follows:

wci＝wci’/(wc1+……+wcn’)。

in this example, the term weight takes on the following values: the error probabilities during the generation of the reference audio and the reference text are correlated. Since the speech recognition program itself also has some error, in this example, the word weight is set taking into account not only the position of the word in the sentence and/or the played sentence component, but also the error of the speech recognition program itself. When the word weight identifies the reference audio for multiple times, the correct probability and/or the error probability of the identification result of the same word by the voice identification program are counted, and the result is used as one of the reference factors for setting the word weight based on at least one of the correct probability and the error probability.

Step p43, assigning a certain weight to each sentence, wherein the weight w is a non-negative real number in the interval of [0,1], and the larger the value is, the more critical the status of the sentence in the whole corpus is represented, and if an error occurs, the larger the understanding influence on the whole corpus is; on the contrary, the more inconsequential the position of the sentence in the whole corpus is, if errors occur, the less obvious the understanding influence on the whole corpus is; assuming that the weights of sentences in a corpus are ws1 ', ws2 ', … … and wsn ', respectively, the normalized weights are calculated as follows:

wsi＝wsi’/(ws1+……+wsn’)

step p5. inserts the original reference audio into the separation SIGNAL at the end of each addition, resulting in a TEST SIGNAL (TEST _ SIGNAL). The separation signal may be a pause of a certain length or may be a Dual Tone Multi Frequency (DTMF) code. When DTMF is used, different DTMF signal combinations may be used at the end of each sentence.

The test execution may include the steps of:

step t1, the test terminal of the calling party enters a calling state, for example, a calling request answering state, or an answering state;

step t2. sending a TEST SIGNAL (TEST _ SIGNAL) on the other side of the call; this test signal is here a radio signal into which the reference audio is converted.

Step t3. records the call audio output at the terminal to be tested, records the output signal of the terminal to be tested, which in this example can be recorded as: damage SIGNAL (graded _ SIGNAL). The impairment signal here may be an output signal of the received audio.

And step t4., ending the call, and exiting the call state of the terminal to be tested.

The quality assessment may comprise the steps of:

step a1. dividing the damage SIGNAL into sentences according to the separation SIGNAL in the damage SIGNAL (hierarchical _ SIGNAL);

step a2, respectively calling a voice recognition program for each sentence to recognize the sentence as a text, and determining the corresponding relation between each sentence and each sentence in the original text according to the separation signal;

step a3. calculates a SCORE for each sentence identified from the impairment signal. Note that a sentence identified in the impairment signal is s', and a sentence in the reference text corresponding to the sentence is s.

For convenience of description, it is assumed that the words into which the sentence s is divided in step p4 are sequentially arranged as w1, w2, … …, wn.

And comparing the sentences s 'and s to find the text segments which are completely identical between the sentences s' and s. The SCORE' for each text segment is the sum of the weights of the individual words contained in the text segment. The maximum value of the SCOREs of the text segments is the final SCORE of the sentence s'.

Step a4. calculates TOTAL SCORE TOTAL _ SCORE of the whole corpus as follows:

wherein n is the total number of sentences in the whole corpus, SCOREi is the sentence score of the ith sentence, ws_iThe normalized weight of the ith sentence.

Step a5. converts the TOTAL SCORE TOTAL _ SCORE into a semantic evaluation SCORE MSS by function f:

MSS＝f(TOTAL_SCORE)

the function f here has the following characteristics: obtaining MSS minimum value more than or equal to MSS _ MIN and maximum value less than or equal to MSS _ MAX; the resulting MSS SCORE monotonically increases with TOTAL _ SCORE. MSS _ MIN, MSS _ MAX are predefined constants.

Different voice recognition programs are adopted, and the obtained MSS scores are different. In this example, a plurality of voice recognition programs may be used, and the MSS may be calculated and then averagely scored, and the average score may be used as a final score of the voice call.

Example 2:

fig. 5 is a complete audio call evaluation system, which comprises a reference audio transmitting end, a reference audio receiving end and an evaluation end. And the sending end of the reference audio transmits the prepared reference signal of the reference audio to the tested device. The acquisition module acquires audio data which are received by the tested equipment and damaged by channel transmission through an acquisition interface; the identification module identifies voice data, converts the voice data into an identification text, and the evaluation module takes the identification text output by the identification module and a relative reference text of a reference audio as input to calculate an MSS score.

Taking the test of a handheld mobile station as an example, a voice call is first established between two mobile stations (1 and 2). The reference signal sending end transmits a reference audio signal to the mobile station 1 to be tested; the evaluation terminal collects data from the mobile station 2 and calculates an MSS score. The MSS portion characterizes the quality of the end-to-end voice communication from the mobile station under test 1 to the mobile station under test 2.

A news broadcast of about two minutes is used as a reference audio that includes a predetermined number of words, for example, about 500 words.

Without losing effectiveness, two sentences of words (the Nanjing Changjiang river bridge is a landmark building, the bridge is currently maintained) are taken as reference audios, and manual calibration of reference texts is detailed in steps p1041 to p1043.

Step p1041, performing word segmentation on the reference text, wherein words are separated by "/" as follows:

"Nanjing/Yangtze river/bridge/Ye/symbolic/architectural".

Current/bridge/ongoing/maintenance.

Where a "/" can be an optional separation character corresponding to a separation signal.

And p1042. endowing each word with a certain word weight. For simplicity, only three weights of 0.1, 0.5, 0.9 are chosen: if the word cannot be heard or is mistakenly heard without any influence on the understanding of the whole sentence, the weight of the word is set to be 0.1; if the word cannot be heard or is mistakenly heard to have a great influence on the understanding of the whole sentence (cannot be understood or the semantics are overturned), the weight of the word is set to be 0.9; the other cases take a weight of 0.5. According to this method, the weight of each word can be selected as follows:

nanjing (0.9)/Changjiang river (0.9)/bridge (0.9)/Ye (0.9)/signaturity (0.9)/construction (0.5).

Currently (0.9)/bridge (0.9)/ongoing (0.5)/maintenance (0.9).

The normalized weights are as follows:

nanjing (0.2142857)/Changjiang (0.2142857)/Daqiao (0.2142857)/Ye (0.0238095)/signaturity (0.2142857)/building (0.1190476).

Current (0.28125)/bridge (0.28125)/ongoing (0.15625)/maintenance (0.28125).

Step p1043, assigning a certain weight to each sentence, wherein the normalized weight is calculated as follows:

first sentence weight: 0.3

Second sentence weight: 0.7.

the calculation method in step a3 is illustrated as follows. Assume the reference text is:

The received signal is identified as: the Changjiang river bridge in Nanjing is a landmark building.

The recognition text has two identical text segments with the reference text: nanjing and Changjiang river bridges are landmark buildings. The scores were 0.2142857, 0.785714, respectively. The final SCORE for this sentence is 0.785714.

The TOTAL SCORE TOTAL _ SCORE is converted to an MSS SCORE by the function:

MSS ═ (MSS _ MAX-MSS _ MIN +1.0) TOTAL _ SCORE + MSS _ MIN-1.0; wherein MSS _ MIN is 1.0, MSS _ MAX is 5.0.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An audio call evaluation effect method is applied to an evaluation device and comprises the following steps:

acquiring an identification text which is corresponding to a receiving audio carrying a separation signal and contains separation characters, wherein the identification text is as follows: identifying the received audio after the reference audio is transmitted at the transmitting and receiving ends; the reference audio is a reference audio provided with the separation signal;

2. The method of claim 1,

performing sentence segmentation on the recognition text based on the separation characters to obtain a recognition sentence;

3. The method of claim 2,

the combining the recognition sentences with the reference sentences in the reference text to obtain sentence scores further comprises:

4. The method of claim 1, 2 or 3,

acquiring the recognition text, including: acquiring identification texts of different versions of reference audio of the same voice content with different attributes;

5. A method according to claim 1, 2 or 3, characterized in that the method further comprises:

obtaining an evaluation parameter of the reference audio;

6. A method according to claim 1, 2 or 3, characterized in that the method further comprises:

receiving a reference audio sent by a sending end to obtain a received audio;

the acquiring of the identification text containing the separation characters corresponding to the received audio carrying the separation signal includes:

and identifying the received audio carrying the separation signal and obtaining the identification text containing the separation characters.

7. An evaluation device, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an identification text which is corresponding to a receiving audio carrying a separation signal and contains separation characters, and the identification text is as follows: identifying that a receiving end receives a received audio corresponding to a reference audio sent by a sending end; the reference audio is a reference audio provided with the separation signal;

8. An evaluation device, comprising: a transceiver, a memory, a processor, and a computer program stored on the memory and executed by the processor;

the processor, connected to the transceiver and the memory respectively, is configured to implement the method for evaluating an effect of an audio call provided in any one of claims 1 to 6 by executing the computer program.

9. A computer storage medium storing a computer program; the computer program, when executed, enables the method of assessing an effect of an audio call as provided in any one of claims 1 to 6.