WO2015169064A1 - 一种网络语音质量评估方法、装置和系统 - Google Patents

一种网络语音质量评估方法、装置和系统 Download PDF

Info

Publication number
WO2015169064A1
WO2015169064A1 PCT/CN2014/089401 CN2014089401W WO2015169064A1 WO 2015169064 A1 WO2015169064 A1 WO 2015169064A1 CN 2014089401 W CN2014089401 W CN 2014089401W WO 2015169064 A1 WO2015169064 A1 WO 2015169064A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
speech
lost
voice
frames
Prior art date
Application number
PCT/CN2014/089401
Other languages
English (en)
French (fr)
Inventor
杨付正
李学敏
江亮亮
肖玮
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP14891207.4A priority Critical patent/EP3091720A4/en
Publication of WO2015169064A1 publication Critical patent/WO2015169064A1/zh
Priority to US15/248,079 priority patent/US10284712B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M7/00Arrangements for interconnection between switching centres
    • H04M7/006Networks other than PSTN/ISDN providing telephone service, e.g. Voice over Internet Protocol (VoIP), including next generation networks with a packet-switched transport layer
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2236Quality of speech transmission monitoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a network voice quality assessment method, apparatus, and system.
  • VoIP Voice over Internet Protocol
  • IP Internet Protocol
  • the voice quality assessment methods can be divided into: parameter planning model, cladding model, bitstream layer model, media layer model and hybrid model; among them, the network based on the cladding model
  • the speech quality assessment method only allows the quality of the speech to be evaluated by analyzing the header information of the voice packet, which has low computational complexity and is suitable for situations in which the packet payload information cannot be intervened.
  • the network voice quality assessment method based on the bitstream layer model not only allows analysis of the packet header information of the voice data packet, but also analyzes the voice load information and even performs speech decoding, such as analyzing the waveform of the voice signal to obtain more detailed information.
  • the frame loss and distortion information which is more accurate than the network voice quality assessment method based on the cladding model, but the computational complexity of the packet model is higher. Therefore, these two methods have their own advantages, and are two network voice quality assessment methods that are commonly used in the prior art.
  • the average compression code rate of the voice is generally used to evaluate the compression distortion, and the utilization is utilized.
  • the average packet loss rate is used to estimate the distortion caused by packet loss, and then the network voice quality is evaluated based on the distortion caused by compression distortion and packet loss.
  • the inventors of the present invention found that due to the complexity of the composition of speech, for example, mute (such as speaking gaps) often occurs in speech, and the existing solution is only The speech quality is measured according to its average distortion. Therefore, its prediction accuracy is not high and the evaluation results are not accurate enough.
  • the embodiment of the invention provides a network voice quality assessment method, device and system, which can improve prediction accuracy and improve the accuracy of the evaluation result.
  • an embodiment of the present invention provides a network voice quality assessment method, including:
  • the speech quality of the speech sequence is evaluated based on the speech quality of each of the sentences.
  • the parsing the data packet to obtain an analysis result includes:
  • Parsing the packet header of the data packet to obtain an analysis result where the parsing result includes a duration of the speech sequence, a bit number of the speech sequence, a frame loss position, and a voice load;
  • Determining, according to the parsing result, the frame content characteristic of the data packet including: determining, according to the frame loss location, a portion of the lost frame that needs to be detected in the data packet, respectively, according to the duration of the voice sequence, voice
  • the number of bits of the sequence and the voice load determine the frame content characteristics of the previous adjacent un-lossed frame of the lost frame portion and the frame content characteristics of the subsequent adjacent un-missed frame, according to which the previous neighbor is not lost
  • the frame content characteristics of the frame, the frame content characteristics of the next adjacent unmissed frame, and the flag of the next adjacent unreduced frame determine the frame content characteristics of the lost frame portion.
  • determining, according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load, determining frame content characteristics of the unmissed frame including:
  • the unrecognized frame is a silence frame.
  • the frame content characteristic according to the previous adjacent unreduced frame, the next adjacent unmissed frame determines the frame content characteristics of the lost frame portion, including:
  • the lost frame portion is a silent frame, otherwise, the lost frame portion is determined to be a speech frame.
  • the voice frame includes a key voice frame and a non-key voice frame, where the determined lost frame portion is a voice frame.
  • the previous adjacent unrecognized frame is a speech frame, and the subsequent adjacent unreduced frame is a silence frame, determining that the first half of the lost frame portion is a key speech frame, the loss The second half of the frame portion is a non-critical speech frame;
  • the first adjacent unrecognized frame is a silent frame, and the next adjacent unreduced frame is a voice frame, determining that the first half of the lost frame portion is a non-critical speech frame, The second half of the lost frame portion is the key speech frame.
  • the parsing the data packet to obtain an analysis result includes:
  • Determining, according to the parsing result, a frame content characteristic of the data packet including: determining, in the data packet, a missing frame portion that needs to be detected according to the frame loss position, according to the calculated frame energy and the average frame energy. Determining a frame content characteristic of a previous adjacent unreduced frame of the lost frame portion and a frame content characteristic of a subsequent adjacent unreduced frame, according to a frame content characteristic of the previous adjacent unreduced frame and The frame content characteristics of an adjacent un-lossed frame determine the frame content characteristics of the lost frame portion.
  • determining a frame content characteristic of the unmissed frame according to the calculated frame energy and the average frame energy including:
  • the frame energy of the unreduced frame is greater than the average frame energy, it is determined that the unrecovered frame is a key speech frame.
  • the frame content characteristic determines the frame content characteristics of the lost frame portion, including:
  • the lost frame portion is a non-key speech frame if it is determined that the previous adjacent un-loss frame and the next adjacent un-missed frame are non-key speech frames;
  • the subsequent adjacent unrecognized frame is a silence frame
  • the subsequent adjacent unrecognized frame is a silent frame
  • the subsequent adjacent unreduced frame is a key speech frame
  • the lost frame portion is a key speech frame
  • the lost frame portion is a non-critical speech frame.
  • the sentence sequence is divided according to the determined frame content characteristic, and the divided statement is obtained Divided into multiple frame dropping events, including:
  • the adjacent two frame dropping portions are determined as two frame dropping events.
  • the voice quality of each statement is evaluated according to the preset voice quality according to the non-speech parameter Evaluate and get the speech quality of each statement, including:
  • the speech quality of the sentence is calculated based on the total number of lost speech frames.
  • the non-speech parameter includes a distance between a non-key speech frame and a key speech frame, a number of lost frames of the speech frame, and a speech frame.
  • performing the distortion mapping on the frame loss event according to the preset voice quality assessment model, and obtaining the total number of lost voice frames include:
  • the non-key speech frame in the frame loss event is mapped to the number of the lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual number of lost frames according to the speech frame is determined.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping;
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the number of frame loss of the voice frame, the length of the lost frame of the voice frame, and the length of the damage, and the actual number of lost frames is determined according to the number of frames lost in the voice frame.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the actual number of lost key speech frames and the number of lost speech frames obtained by mapping; or
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the loss length and the damage length of the voice frame, and the number of key voice frames actually lost is determined according to the number of lost frames of the voice frame.
  • the frame loss event is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost voice frames obtained by mapping.
  • the non-speech parameter includes a distance between a non-key speech frame and a key speech frame, a number of lost frames of the speech frame, and an average And the lost frame and the lost frame in different discrete distributions are mapped to the total number of lost voice frames according to the non-speech parameter, including:
  • the non-key speech frame in the frame loss event is mapped to the number of the lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual number of lost frames according to the speech frame is determined.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping;
  • the frame loss event is mapped to the total number of lost voice frames according to the average loss length and the average damage length.
  • the embodiment of the present invention further provides a network voice quality evaluation apparatus, including an acquiring unit, a parsing unit, a determining unit, a dividing unit, an extracting unit, and an evaluating unit, where:
  • An acquiring unit configured to acquire a data packet of the network voice, where the data packet of the network voice includes a voice sequence
  • the parsing unit is configured to parse the data packet acquired by the obtaining unit to obtain an analysis result
  • a determining unit configured to determine, according to the parsing result obtained by the parsing unit, a frame content of the data packet
  • the frame content characteristic includes a silence frame and a voice frame
  • a dividing unit configured to perform statement segmentation on the speech sequence according to a frame content characteristic determined by the determining unit, and divide the divided statement into a plurality of frame dropping events
  • An extracting unit configured to extract a non-speech parameter according to a frame dropping event divided by the dividing unit, where the non-speech parameter includes a position parameter and a discrete distribution parameter;
  • An evaluation unit configured to evaluate a voice quality of each sentence according to a preset voice quality evaluation model according to a non-speech parameter extracted by the extracting unit, to obtain a voice quality of each sentence, according to the voice quality evaluation of each statement The speech quality of the speech sequence.
  • the parsing unit is configured to parse the packet header of the data packet to obtain an analysis result, where the parsing result includes a duration of the speech sequence, a bit number of the speech sequence, a frame loss position, and a voice load;
  • the determining unit is specifically configured to determine, in the data packet, a missing frame part that needs to be detected according to the frame loss position, and determine the loss according to a duration of the voice sequence, a bit number of a voice sequence, and a voice load, respectively.
  • the frame content characteristics of the unmissed frame, the frame content characteristics of the next adjacent unmissed frame, and the flag of the next adjacent unreduced frame determine the frame content characteristics of the lost frame portion.
  • the determining unit is specifically configured to obtain an actual payload length of the unmissed frame; determine a code rate according to the voice load, the number of bits of the voice sequence, and the duration of the voice sequence; if the standard payload length corresponding to the code rate Consistent with the actual payload length, determining that the unrecovered frame is a voice frame; if the standard payload length corresponding to the code rate is inconsistent with the actual payload length, determining that the unrecovered frame is muted frame.
  • the determining unit is specifically configured to: if it is determined that the previous adjacent un-reduced frame and the next adjacent un-reduced frame are both silence frames, or the flag of the next adjacent un-missed frame indicates the latter one If the adjacent un-missed frame is the first voice frame, it is determined that the lost frame portion is a silence frame, otherwise, the lost frame portion is determined to be a voice frame.
  • the speech frame comprises a key speech frame and a non-key speech frame
  • the determining unit is specifically configured to: when determining that the previous adjacent unloss frame and the subsequent adjacent unreduced frame are both voice frames, determine that the lost frame portion is a key voice frame; When the previous adjacent un-reduced frame is a voice frame, and the next adjacent un-reduced frame is a silence frame, determining that the first half of the lost frame portion is a key voice frame, and the missing frame portion is behind The half portion is a non-critical speech frame; determining that the previous adjacent unrecognized frame is a silent frame, and the subsequent adjacent unrecognized frame is a speech frame, determining the first half of the lost frame portion For a non-critical speech frame, the second half of the lost frame portion is a key speech frame.
  • the parsing unit is configured to parse the packet header of the data packet to obtain an analysis result, where the parsing result includes a duration of the speech sequence, a bit number of the speech sequence, a frame loss position, and a voice load; according to the voice load Performing adaptive multi-rate AMR decoding to obtain an AMR decoded speech signal; calculating frame energy and average frame energy of each frame in the AMR decoded speech signal according to the duration of the speech sequence and the number of bits of the speech sequence;
  • the determining unit is specifically configured to determine, in the data packet, a missing frame portion that needs to be detected according to the frame loss position, and determine a previous neighbor of the lost frame portion according to the calculated frame energy and the average frame energy.
  • the frame content characteristics of the un-missed frame and the frame content characteristics of the next adjacent un-missed frame according to the frame content characteristics of the previous adjacent un-missed frame and the frame content characteristics of the next adjacent un-missed frame A frame content characteristic of the lost frame portion is determined.
  • the determining unit is specifically configured to: if the frame energy of the unreduced frame is less than or equal to 0, determine that the unrecovered frame is a silence frame; if the frame energy of the unreduced frame is greater than 0 and less than an average frame energy, determine The unrecognized frame is a non-critical speech frame; if the frame energy of the unreduced frame is greater than the average frame energy, determining that the unrecovered frame is a key speech frame.
  • the sixth possible implementation manner of the second aspect wherein the determining unit is specifically configured to:
  • the lost frame portion is a non-key speech frame if it is determined that the previous adjacent un-loss frame and the next adjacent un-missed frame are non-key speech frames;
  • the subsequent adjacent unrecognized frame is a silence frame
  • the subsequent adjacent unrecognized frame is a silent frame
  • the subsequent adjacent unreduced frame is a key speech frame
  • the lost frame portion is a key speech frame
  • the lost frame portion is a non-critical speech frame.
  • the dividing unit is configured to: when determining that the number of consecutive frames of the silence frame exceeds a preset number of times, dividing the voice sequence before the silence frame into a statement; determining that a distance between two adjacent frame dropping portions in the statement is less than When the distance is equal to the preset distance, the adjacent two frame dropping portions are determined as one frame dropping event; and when the distance between two adjacent frame dropping portions in the statement is determined to be greater than a preset distance, the adjacent two times are The frame loss portion is determined as two frame dropping events.
  • the evaluation unit is configured to perform distortion mapping on the frame loss event according to the preset voice quality assessment model according to the non-speech parameter, to obtain a total number of lost voice frames, and calculate a statement according to the total number of the lost voice frames. Voice quality.
  • the ninth possible implementation manner of the second aspect wherein the non-speech parameter includes a distance between the non-key speech frame and the key speech frame, a number of frames lost by the speech frame, The lost length of the speech frame and the length of the damage; the evaluation unit is specifically configured to:
  • the non-key speech frame in the frame loss event is mapped to the number of the lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual number of lost frames according to the speech frame is determined.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping;
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the number of frame loss of the voice frame, the length of the lost frame of the voice frame, and the length of the damage, and the actual number of lost frames is determined according to the number of frames lost in the voice frame.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the actual number of lost key speech frames and the number of lost speech frames obtained by mapping; or
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the loss length and the damage length of the voice frame, and the number of key voice frames actually lost is determined according to the number of lost frames of the voice frame.
  • the frame loss event is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost voice frames obtained by mapping.
  • the ninth possible implementation manner of the second aspect wherein the non-speech parameter includes a distance between the non-key speech frame and the key speech frame, and the number of lost frames of the speech frame
  • the average lost length and the average damage length the evaluation unit is specifically used to:
  • the non-key speech frame in the frame loss event is mapped to the number of the lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual number of lost frames according to the speech frame is determined.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping;
  • the frame loss event is mapped to the total number of lost voice frames according to the average loss length and the average damage length.
  • the embodiment of the present invention further provides a communication system, including any network voice quality evaluation apparatus provided by the embodiment of the present invention.
  • the data packet of the obtained network voice is parsed, and the frame content characteristics of the data packet are determined according to the parsing result, for example, determining a silence frame and a voice frame, and then performing the voice sequence according to the determined frame content characteristic.
  • the statement is divided, and the statement is divided into multiple frame dropping events.
  • non-speech parameters including position parameters and discrete distribution parameters
  • the parameters are evaluated according to the preset speech quality evaluation model for the speech quality of each sentence.
  • the speech quality of the entire speech sequence is evaluated according to the speech quality of each sentence; since in this scheme, the speech sequence can be divided into sentences.
  • the frame dropping event division makes the frame loss mode in a single frame dropping event relatively simple, so that it is easy to study the distortion effect caused by each frame dropping event; and, because the scheme is in the process of evaluating the network voice quality,
  • the frame content characteristics (such as determining the silence frame or the voice frame) and the frame loss position are also considered factors. Therefore, compared with the prior art, the scheme for measuring the voice quality of the network based on only the average distortion condition can effectively improve the network.
  • the accuracy of speech quality assessment that is to say, with this scheme, the prediction accuracy can be greatly improved, thereby improving the accuracy of the evaluation results.
  • FIG. 1a is a flowchart of a network voice quality assessment method according to an embodiment of the present invention
  • 1b is a schematic diagram of dividing a voice sequence in a network voice quality assessment method according to an embodiment of the present invention
  • 1c is a schematic diagram of analysis of a frame dropping event in a network voice quality assessment method according to an embodiment of the present invention
  • FIG. 2a is a schematic structural diagram of a network voice quality assessment apparatus according to an embodiment of the present invention.
  • Figure 2b is a diagram showing an example of a sound zone of a word in an embodiment of the present invention.
  • 2c is another flowchart of a network voice quality assessment method according to an embodiment of the present invention.
  • FIG. 3 is still another flowchart of a network voice quality assessment method according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a network voice quality evaluation apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a server according to an embodiment of the present invention.
  • Embodiments of the present invention provide a network voice quality assessment method, apparatus, and system. The details are described below separately.
  • the present embodiment will be described from the perspective of a network voice quality assessment device, which may be integrated into a network side device such as a server.
  • a network voice quality evaluation method includes: acquiring a data packet of a network voice, wherein the data packet of the network voice includes a voice sequence; parsing the data packet to obtain an analysis result; determining a frame of the data packet according to the analysis result; Content characteristics, according to the determined frame content characteristics, the speech sequence is divided into sentences, and the divided sentences are divided into multiple frame dropping events; non-speech parameters are extracted according to the frame dropping event; and the preset voices are according to the non-speech parameters.
  • the quality assessment model evaluates the speech quality of each statement to obtain the speech quality of each sentence; and evaluates the speech quality of the speech sequence based on the speech quality of each statement.
  • the process of the network voice quality assessment method may be specifically as follows:
  • the data packet of the network voice may include a packet header and a voice payload, where the packet header may include a Real-time Transport Protocol (RTP) header, a User Datagram Protocol (UDP) header, and an IP header.
  • RTP Real-time Transport Protocol
  • UDP User Datagram Protocol
  • IP header an IP header.
  • the voice payload can include a sequence of speeches and the like.
  • the method of parsing the data packets is also different.
  • the details may be as follows:
  • the packet header of the data packet can be parsed to obtain an analysis result.
  • the parsing result may include the duration of the speech sequence, the number of bits of the speech sequence, the frame loss position, and the voice load.
  • the method for obtaining the above parameters may be specifically as follows:
  • Duration i Timestamp i+1 -Timestamp i ,
  • Timestamp i is the timestamp of the i-th data packet
  • Timestamp i+1 is the timestamp of the i+ 1th data packet, which can be read from the RTP header of the data packet.
  • LIP i is the number of bits of the i-th data packet, which can be directly obtained by the IP header; HIP i is the IP protocol header length of the i-th data packet, and HUDP i is the UDP header length of the i-th data packet, HRTP i The length of the RTP protocol header for the i-th IP packet.
  • the voice load and the voice duration Duration max of the i-th packet are recorded, wherein the voice load refers to the number of RTP payload bits when the packet load is maximum, which is denoted as B max . It is generally considered that the i-th data packet is non-silent, and the unmuted code rate of the i-th data packet is:
  • sequence number field in the RTP header indicates the order of the packets, and the location (in terms of the frame loss) and the number of lost frames can be determined based on the RTP sequence number of each packet.
  • bit flow model needs to parse the voice payload part in addition to the packet header of the data packet, as follows:
  • AMR adaptive multi-rate decoding
  • the frame energy of each frame can be quantized according to the auditory characteristics and the subjective experience of the human ear. If the frame energy is greater than 0, it is a speech frame, and the average capability of the speech frame is calculated according to the average frame energy.
  • the frame content characteristic may include a silence frame and a voice frame, that is, the step (step 103) may detect the frame content of the data packet to determine whether it is a voice frame or a silence frame, wherein the voice frame may further be divided into key Speech frames and non-critical speech frames.
  • the method for determining the frame content characteristics of the data packets is also different.
  • the clover model and the bitstream layer model are taken as examples. as follows:
  • A. Determine, in the data packet, the portion of the lost frame that needs to be detected according to the frame loss position.
  • one lost frame part may include consecutive multiple lost frames.
  • the determining, according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load, the frame content characteristics of the unmissed frame may include:
  • a code rate ie, a coding rate
  • the unrecovered frame is a silence frame.
  • a correspondence table may be set for recording the correspondence between the code rate and the standard payload length, so that the corresponding standard payload length can be obtained by searching the correspondence table according to the code rate. For example, see Table 1 for details.
  • AMR515 5.15 15 AMR59 5.9 17 AMR67 6.7 19 AMR74 7.4 twenty one AMR795 7.95 twenty two AMR102 10.2 28 AMR122 12.2 33 ; Hence —
  • the standard payload length corresponding to the code rate of 4.75 kb/s is 14 bytes (Byte), so if the actual payload of the unmissed frame is 14 bytes, it is a speech frame. Otherwise, if the actual payload of the unrecovered frame is not 14 bytes, it is a silent frame, and so on, and so on.
  • a speech frame (such as labeled 1) determines that the lost frame portion is a silence frame, otherwise it determines that the lost frame portion is a speech frame.
  • the speech frame can also be divided into key speech frames and non-critical speech frames, so that different processing can be subsequently performed for these key speech frames and non-critical speech frames.
  • the key speech frame refers to a frame that has a great influence on the voice quality
  • the key speech frame refers to a frame that has less influence on the voice quality.
  • the voice frame is divided into a key voice frame and a non-key voice frame, the following steps may be included in the step of determining that the lost frame portion is a voice frame:
  • the frame content detection of the bitstream layer model is more elaborate than the cladding model, for example, where the speech frame can include key speech frames and non-critical speech frames, and the like.
  • the steps of "determining the frame content characteristics of the packet according to the obtained parsing result" may specifically be as follows:
  • A. Determine, in the data packet, a portion of the lost frame that needs to be detected according to the frame loss position.
  • one lost frame part may include consecutive multiple lost frames.
  • the frame content characteristics of the unmissed frame are determined according to the calculated frame energy and the average frame energy, including:
  • the frame energy of the unreduced frame is greater than the average frame energy, it is determined that the unrecovered frame is a key speech frame.
  • the loss is The second half of the frame portion is the key speech frame
  • Ns can be set according to actual application requirements, for example, Ns can be set to 6.
  • the preset number of times and the preset distance may be set according to actual application requirements, for example, The number of times can be set to 6, the preset distance can be set to 10, and so on.
  • the non-speech parameter may include a location parameter and a discrete distribution parameter.
  • the non-speech parameters may be different according to different subsequent speech quality assessment methods.
  • the non-speech parameter may include non-key speech.
  • the distance between the frame and the key speech frame, the number of lost frames of the speech frame, the length of the lost speech frame, and the length of the damage; or the non-speech parameter can also be the distance between the non-key speech frame and the key speech frame, and the speech frame.
  • the number of dropped frames, the average lost length, and the average damage length are briefly described below:
  • the distance L j between the non-critical speech frame and the key speech frame According to the auditory perception characteristic of the human ear, the farther the lost non-critical speech frame is from the adjacent key speech frame, the smaller the distortion is caused.
  • the voice quality of each statement is evaluated according to the preset voice quality evaluation model according to the non-speech parameter, and the voice quality of each statement is obtained.
  • the specificity may be as follows:
  • the frame loss event is subjected to distortion mapping according to the preset voice quality evaluation model according to the preset voice quality evaluation model, and the total number of lost voice frames is obtained, and the voice quality of the sentence is calculated according to the total number of the lost voice frames.
  • the step of “distorting the frame loss event according to the preset voice quality evaluation model according to the non-speech parameter to obtain the total number of lost voice frames” may be implemented by any one of the following methods:
  • the non-speech parameters may include a position parameter and a discrete distribution parameter, etc., wherein the non-speech parameter may include a distance between the non-critical speech frame and the key speech frame, a number of lost frames of the speech frame, a length of the lost speech frame, and a length of the damage.
  • the step of “distorting the frame loss event according to the preset voice quality evaluation model according to the non-speech parameter to obtain the total number of lost voice frames” may include:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key voice frames is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost key voice frames obtained by mapping, as follows:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the number of frame loss of the voice frame, the length of the lost frame of the voice frame, and the length of the damage, and the frame is dropped according to the voice frame.
  • the number of key voice frames that are actually lost is determined, and the frame loss event is mapped to the total number of lost voice frames according to the number of the lost key voice frames and the number of lost voice frames obtained by the mapping; for example, the following may be specifically:
  • the specific can be:
  • N 0k is the length of the lost time of the speech frame
  • a 0k is the damage effect of the number of lost frames of the speech frame and the length of one loss on the loss of a single unmissed speech frame
  • L k is the damage length of the kth occurrence of the frame loss event
  • FLN i, k is the number of single-frame mapped speech frames in the L k speech impairment frames in the i-th frame loss event.
  • the parameters a 1 , b 1 , c 1 , a 2 , b 2 , c 2 , a 3 , b 3 and c 3 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the non-speech parameters may include a position parameter and a discrete distribution parameter, etc., wherein the non-speech parameter includes a distance between the non-critical speech frame and the key speech frame, a number of frames lost by the speech frame, a length of the lost speech frame, and a length of the damage.
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the loss length and the damage length of the voice frame, and the number of key voice frames actually lost is determined according to the number of frames lost by the voice frame, according to the The number of key speech frames actually lost and the number of lost speech frames obtained by mapping are mapped to the total number of lost speech frames; for example, the specifics are as follows:
  • the specific can be:
  • the FLN i,k is the number of voice frames mapped in the L k speech impairment frames in the i-th frame loss event
  • a 0k is the damage impact of the one-time loss length on the single un-lost speech frame in the frame dropping event
  • parameter a 1 , b 1 , c 1 , a 2 , b 2 and c 2 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the third method no longer calculates the distortion of a single frame, but directly calculates the distortion of the entire frame loss event.
  • the non-speech parameter may include a position parameter and a discrete distribution parameter, and the non-speech parameter may include a distance between the non-key speech frame and the key speech frame, a number of lost frames of the speech frame, an average loss length, and an average damage length. Then, according to the non-speech parameter, different bits in the frame dropping event The lost frames and the lost frames in different discrete distributions are mapped to the total number of lost voice frames, which may include:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping.
  • the frame loss event is mapped to the total number of lost voice frames according to the average loss length and the average damage length, and the formula can be:
  • the FLN i maps the frame loss event to the total number of lost speech frames, N 0 is the average loss length of the speech frame, and L is the damage length; wherein the parameters a 1 , b 1 , c 1 , a 2 , b 2 , a 3 and b 3 can be obtained through training.
  • the speech quality of the sentence can be calculated according to the total number of lost speech frames, as follows:
  • the number of lost speech frames for a statement is:
  • FLN f(FLN 1 , FLN 2 ,...,FLN M );
  • M is the number of lost frame events in each statement
  • FLN i is the total number of lost speech frames obtained by mapping each frame loss event.
  • MOS 0 f(R);
  • the quality evaluation can directly look up the table to obtain MOS 0 .
  • Q n is the quality of the statement considering the loss of the packet
  • N is the number of statements in the speech sequence.
  • the data packet of the acquired network voice is parsed, and the frame content characteristics of the data packet are determined according to the analysis result, for example, it is determined to be a silence frame and a voice frame, and then according to the determined frame content characteristics.
  • the speech sequence is divided into sentences, and the statement is divided into multiple frame dropping events.
  • the non-speech parameters including position parameters and discrete distribution parameters
  • the non-speech parameters are used according to the preset speech quality evaluation model. Evaluate the speech quality of each sentence.
  • the speech segmentation and the frame loss event segmentation can be performed to make a single drop.
  • the frame loss mode in the frame event is relatively simple, so it is easy to study the distortion effect caused by each frame loss event; and, because the scheme evaluates the network voice quality, the frame content characteristics (such as determining the silence frame or Voice frame) and frame loss position are also considered factors, so only roots relative to the prior art According to the scheme of measuring the voice quality of the network according to the average distortion condition, the accuracy of the network voice quality assessment can be effectively improved; that is to say, the scheme can greatly improve the prediction accuracy, thereby improving the accuracy of the evaluation result.
  • a cladding model will be described as an example.
  • the network voice quality evaluation apparatus may include a parsing module, a detecting module, a dividing module, a non-speech parameter extracting module, and a voice quality evaluating module, where functions of each module are
  • the body can be as follows:
  • the parsing module is configured to obtain a data packet of the network voice, and parse the obtained data packet to obtain an analysis result, where the parsing result may include a duration of the speech sequence, a bit number of the speech sequence, a frame loss position, and a voice. Load, etc.
  • the detecting module is mainly configured to determine a frame content characteristic of the data packet according to the obtained parsing result, that is, determine whether each data frame is a silence frame or a voice frame.
  • the frame content characteristic of the unmissed frame may be specifically analyzed, and then according to the nature of the short-term correlation of the speech signal, the frame content characteristic of the adjacent un-missed frame is used to determine the frame content characteristic of the currently lost frame.
  • the frame loss occurs at different positions of the word/kanji, and its influence is also different.
  • A represents the middle zone (or key zone) of the word
  • B and C respectively represent the beginning and the end of the word ( Generally referred to as non-critical areas
  • D represents a silent area.
  • the detecting module can further determine the lost part of the frame to determine whether the current lost frame is a key speech frame or a non-key speech frame.
  • the dividing module is configured to perform statement segmentation on the speech sequence according to the determined frame content characteristic, and divide the divided statement into a plurality of frame dropping events.
  • the non-speech parameter extraction module is configured to extract non-speech parameters according to the frame loss event.
  • the non-speech parameter may include a position parameter and a discrete distribution parameter.
  • the voice quality evaluation module is configured to evaluate the voice quality of each sentence according to the preset voice quality evaluation model according to the non-speech parameter, obtain the voice quality of each sentence, and then evaluate the voice according to the voice quality of each sentence.
  • the voice quality of the sequence is configured to evaluate the voice quality of each sentence according to the preset voice quality evaluation model according to the non-speech parameter, obtain the voice quality of each sentence, and then evaluate the voice according to the voice quality of each sentence. The voice quality of the sequence.
  • a network voice quality assessment method as shown in Figure 2c, the specific process can be as follows:
  • the parsing module acquires a data packet of the network voice.
  • the data packet of the network voice may include a packet header and a voice payload, where the packet header may include an RTP header, a UDP header, an IP header, and the like, and the voice payload may include a voice sequence or the like.
  • the parsing module parses the packet header of the data packet to obtain an analysis result.
  • the parsing result may include the duration of the speech sequence, the number of bits of the speech sequence, the frame loss position, and the voice load.
  • the method for obtaining the above parameters may be specifically as follows:
  • Duration i Timestamp i+1 -Timestamp i ,
  • Timestamp i is the timestamp of the i-th data packet
  • Timestamp i+1 is the timestamp of the i+ 1th data packet, which can be read from the RTP header of the data packet.
  • LIP i is the number of bits of the i-th data packet, which can be directly obtained by the IP header; HIP i is the IP protocol header length of the i-th data packet, and HUDP i is the UDP protocol header length of the i-th data packet, HRTP i is the length of the RTP protocol header of the i-th IP packet.
  • the voice load and the voice duration Duration max of the i-th packet are recorded, wherein the voice load refers to the number of RTP payload bits when the packet load is maximum, which is denoted as B max . It is generally considered that the i-th data packet is non-silent, and the unmuted code rate of the i-th data packet is:
  • sequence number field in the RTP header indicates the order of the packets, and the location (in terms of the frame loss) and the number of lost frames can be determined based on the RTP sequence number of each packet.
  • the detecting module determines, in the data packet, a portion of the lost frame that needs to be detected according to the frame loss position.
  • one lost frame part may include consecutive multiple lost frames.
  • the detecting module determines, according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load, a frame content characteristic of the previous adjacent un-missed frame of the lost frame portion and a subsequent adjacent one. The frame content characteristics of the frame are not lost, and the mark of the next adjacent unrecognized frame is determined.
  • the detecting module can determine the frame content of the n-1th frame according to the duration of the speech sequence, the number of bits of the speech sequence, and the voice load, respectively. The characteristics and the frame content characteristics of the n+mth frame, and the flag identifying the n+mth frame.
  • the frame content characteristic may include a silence frame and a voice frame, and the frame content characteristics of the unmissed frame are determined according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load.
  • a code rate ie, a coding rate
  • the unrecovered frame is a silence frame.
  • a correspondence table may be set to record the correspondence between the code rate and the standard payload length, so that the corresponding standard payload length can be obtained by searching the correspondence table according to the code rate. For example, refer to Table 1 for details. .
  • the detecting module determines, according to a frame content characteristic of the previous adjacent unremoved frame, a frame content characteristic of a subsequent adjacent unreduced frame, and a flag of a subsequent adjacent unreduced frame, determining a frame content of the lost frame portion. characteristic.
  • the current lost frame is the nth frame to the n+m-1 frame (ie, the lost frame portion is the nth frame to the n+m-1 frame)
  • the previous adjacent unreduced frame is the nth- 1 frame
  • the next adjacent unrecovered frame is the n+m frame
  • n-1th frame and the n+mth frame are both silence frames, or the flag of the n+mth frame indicates that the n+mth frame is the first speech frame (such as labeled as 1), then Determining that the lost frame portion is a silent frame, otherwise, determining that the lost frame portion is a voice frame.
  • the voice frame may also be divided into a key speech frame and a non-key speech frame. If the speech frame is divided into a key speech frame and a non-key speech frame, the step is determined in the step.
  • the lost frame part is a voice frame.
  • the specific case may include the following:
  • n-1th frame is a speech frame
  • n+m frame is a mute frame
  • determining that the first half of the lost frame portion is a key speech frame
  • determining that the second half of the lost frame portion is non- Key speech frame
  • n-1th frame is a silent frame
  • n+m frame is a speech frame
  • determining that the first half of the lost frame portion is a non-critical speech frame
  • determining that the second half of the lost frame portion is a key Speech frame.
  • the dividing unit divides the speech sequence according to the determined frame content characteristic, and divides the divided statement into multiple frame dropping events.
  • Ns can be set according to actual application requirements, for example, Ns can be set to 6.
  • the preset number of times and the preset distance may be set according to actual application requirements, for example, the preset number of times may be set to 6, the preset distance may be set to 10, and the like.
  • the non-speech parameter extraction module extracts non-speech parameters according to the frame loss event.
  • the non-speech parameter may include a position parameter and a discrete distribution parameter, and the non-speech parameter may include a distance between the non-key speech frame and the key speech frame, a number of lost frames of the speech frame, a length of the lost speech frame, and a length of the damage. ,as follows:
  • the distance L j between the non-critical speech frame and the key speech frame According to the auditory perception characteristic of the human ear, the farther the lost non-critical speech frame is from the adjacent key speech frame, the smaller the distortion is caused.
  • the number of lost frames of the voice frame N 1 refers to the number of lost voice frames in the frame loss event.
  • the one-time loss length of the speech frame N 0k refers to the number of consecutive lost speech frames each time a frame loss occurs.
  • Damage length L k refers to the number of speech frames that have not been lost in two consecutive frame dropping events.
  • the voice quality assessment module performs distortion mapping on the frame loss event according to the preset voice quality assessment model according to the obtained non-speech parameter, and obtains a total number of lost voice frames.
  • the following may be specifically as follows:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key voice frames is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost key voice frames obtained by mapping, as follows:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the number of frame loss of the voice frame, the length of the lost frame of the voice frame, and the length of the damage, and the actual number of lost frames is determined according to the number of frames lost in the voice frame.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of the lost key speech frames and the number of lost speech frames obtained by the mapping; for example, the following may be specifically as follows:
  • the specific can be:
  • N 0k is the length of the lost time of the speech frame
  • a 0k is the damage effect of the number of lost frames of the speech frame and the length of one loss on the loss of a single unmissed speech frame
  • L k is the damage length of the kth occurrence of the frame loss event
  • FLN i, k is the number of single-frame mapped speech frames in the L k speech impairment frames in the i-th frame loss event.
  • the parameters a 1 , b 1 , c 1 , a 2 , b 2 , c 2 , a 3 , b 3 and c 3 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the voice quality assessment module calculates a voice quality of the statement according to the total number of the lost voice frames, as follows:
  • the number of lost speech frames for a statement is:
  • FLN f(FLN 1 , FLN 2 ,...,FLN M );
  • M is the number of lost frame events in each statement
  • FLN i is the total number of lost speech frames obtained by mapping each frame loss event.
  • the function can be as follows:
  • MOS 0 f(R);
  • the function can be as follows:
  • D is statement distortion
  • MOS 0 does not consider the quality of the statement that suffers from packet loss (ie, the compression distortion of the statement)
  • Q n is the quality of the statement considering the packet loss
  • a and b are model fixed parameters, a and b can Obtained through training.
  • the quality evaluation can directly look up the table to obtain MOS 0 .
  • the code rate R ie, the coding rate
  • the MOS 0 the quality evaluation can directly look up the table to obtain MOS 0 .
  • the MOS 0 corresponding to the code rate of 4.75 kb/s is 3.465
  • the MOS 0 corresponding to the code rate of 5.15 kb/s is 3.502, and so on.
  • the voice quality evaluation module evaluates the voice quality of the voice sequence according to the voice quality of each sentence, that is, synthesizes the voice quality of each sentence in the voice sequence, and obtains the voice quality Q of the voice sequence, as follows:
  • Q is the speech quality of the speech sequence
  • Q n is the quality of the statement considering the loss of the packet
  • N is the number of statements in the speech sequence.
  • the embodiment uses the packet model based method to parse the obtained network voice packet, and determines the frame content characteristics of the packet according to the analysis result, for example, determining the silence frame and the voice frame, and then according to The determined frame content characteristic divides the speech sequence into sentences, and divides the statement into multiple frame dropping events. After extracting non-speech parameters (including position parameters and discrete distribution parameters) according to the frame dropping event, according to the non-speech parameters, according to the pre-predicting The speech quality assessment model evaluates the speech quality of each sentence. Finally, the speech quality of the entire speech sequence is evaluated according to the speech quality of each sentence; in this scheme, the speech sequence can be divided and dropped by speech.
  • non-speech parameters including position parameters and discrete distribution parameters
  • the event partitioning makes the frame loss mode in a single frame dropping event relatively simple, so that it is easy to study the distortion effect caused by each frame dropping event; and, because the scheme evaluates the network voice quality, the frame content characteristics are (such as determining whether a silence frame or a voice frame) and the frame loss position are also considered factors, therefore, Compared with the prior art, only the scheme based on the average distortion condition to measure the voice quality of the network can effectively improve the accuracy of the network voice quality assessment; that is to say, the scheme can greatly improve the prediction accuracy and thus improve the evaluation. The accuracy of the results.
  • bitstream layer model will be described as an example.
  • the network voice quality evaluation apparatus used in this embodiment is the same as that in the second embodiment, as shown in FIG. 2a and the description in the second embodiment.
  • the difference between this embodiment and the second embodiment mainly lies in the analysis of the data packet and the detection of the frame content characteristics, which will be described in detail below.
  • a network voice quality assessment method as shown in Figure 3, the specific process can be as follows:
  • the parsing module acquires a data packet of the network voice.
  • the data packet of the network voice may include a packet header and a voice payload, where the packet header may include an RTP header, a UDP header, an IP header, and the like, and the voice payload may include a voice sequence or the like.
  • the parsing module parses the packet header of the data packet to obtain an analysis result.
  • the parsing result may include the duration of the speech sequence, the number of bits of the speech sequence, the frame loss position, and the voice load.
  • the method for obtaining the above parameters may be specifically as follows:
  • Duration i Timestamp i+1 -Timestamp i .
  • Timestamp i is the timestamp of the i-th data packet
  • Timestamp i+1 is the timestamp of the i+ 1th data packet, which can be read from the RTP header of the data packet.
  • B i LIP i -HIP i -HUDP i -HRTP i .
  • LIP i is the number of bits of the i-th data packet, which can be directly obtained by the IP header; HIP i is the IP protocol header length of the i-th data packet, and HUDP i is the UDP protocol header length of the i-th data packet, HRTP i is the length of the RTP protocol header of the i-th IP packet.
  • the voice load and the voice duration Duration max of the i-th packet are recorded, wherein the voice load refers to the number of RTP payload bits when the packet load is maximum, which is denoted as B max . It is generally considered that the i-th data packet is non-silent, and the unmuted code rate of the i-th data packet is:
  • sequence number field in the RTP header indicates the order of the packets, and the location (in terms of the frame loss) and the number of lost frames can be determined based on the RTP sequence number of each packet.
  • the parsing module performs adaptive multi-rate decoding (AMR) according to the voice load, and obtains the AMR decoded speech signal.
  • AMR adaptive multi-rate decoding
  • the parsing module calculates a frame energy and an average frame energy of each frame in the AMR decoded speech signal according to the duration of the speech sequence and the number of bits of the speech sequence.
  • the frame energy of each frame can be quantized according to the auditory characteristics and the subjective experience of the human ear. If the frame energy is greater than 0, it is a speech frame, and the average capability of the speech frame is calculated according to the average frame energy.
  • the detecting module determines, in the data packet, a missing frame portion that needs to be detected according to the frame loss position.
  • one lost frame part may include consecutive multiple lost frames.
  • the detecting module determines, according to the calculated frame energy and the average frame energy, a frame content characteristic of a previous adjacent unloss frame of the lost frame portion and a frame content characteristic of a subsequent adjacent unmissed frame.
  • the detecting module can determine the frame content of the n-1th frame according to the duration of the speech sequence, the number of bits of the speech sequence, and the voice load, respectively. Characteristics and frame content characteristics of the n+mth frame.
  • the frame content characteristic may include a silence frame and a voice frame, and determining a frame content characteristic of the unmissed frame according to the calculated frame energy and the average frame energy, including:
  • the frame energy of the unreduced frame is greater than the average frame energy, it is determined that the unrecovered frame is a key speech frame.
  • the detecting module determines, according to a frame content characteristic of the previous adjacent unreduced frame and a frame content characteristic of a subsequent adjacent unreduced frame, a frame content characteristic of the lost frame part, which may be as follows:
  • the current lost frame is the nth frame to the n+m-1 frame (ie, the lost frame portion is the nth frame to the n+m-1 frame)
  • the previous adjacent unreduced frame is the nth- 1 frame
  • the next adjacent unrecovered frame is the n+m frame
  • n-1th frame and the n+mth frame are both silence frames, determining that the lost frame portion is a silence frame
  • n-1th frame and the n+th frame are both non-key speech frames, determining that the lost frame portion is a non-key speech frame
  • n-1th frame is a key speech frame
  • n+m frame is a mute frame
  • the first half of the lost frame portion is a key speech frame
  • the second half of the lost frame portion is non-critical Speech frame
  • n-1th frame is a silent frame
  • n+m frame is a key speech frame
  • the first half of the lost frame portion is a non-key speech frame
  • the second half of the lost frame portion is a key Speech frame
  • n-1th frame is a key speech frame
  • n+m frame is a non-key speech frame
  • n-1th frame is a non-key speech frame
  • n+m frame is a key speech frame
  • n-1th frame is a non-key speech frame
  • n+m frame is a silence frame
  • n-1th frame is a silent frame
  • n+m frame is a non-key speech frame
  • the dividing unit divides the speech sequence according to the determined frame content characteristic, and divides the divided statement into multiple frame dropping events.
  • Ns can be set according to actual application requirements, for example, Ns can be set to 6.
  • the preset number of times and the preset distance may be set according to actual application requirements, for example, The number of times can be set to 6, the preset distance can be set to 10, and so on.
  • the non-speech parameter extraction module extracts non-speech parameters according to the frame loss event.
  • the non-speech parameter may include a position parameter and a discrete distribution parameter, and the non-speech parameter may include a distance between the non-key speech frame and the key speech frame, a number of lost frames of the speech frame, a length of the lost speech frame, and a length of the damage. ,as follows:
  • the distance L j between the non-critical speech frame and the key speech frame According to the auditory perception characteristic of the human ear, the farther the lost non-critical speech frame is from the adjacent key speech frame, the smaller the distortion is caused.
  • the number of lost frames of the voice frame N 1 refers to the number of lost voice frames in the frame loss event.
  • the one-time loss length of the speech frame N 0k refers to the number of consecutive lost speech frames each time a frame loss occurs.
  • Damage length L k refers to the number of speech frames that have not been lost in two consecutive frame dropping events.
  • the voice quality assessment module performs distortion mapping on the frame loss event according to the preset voice quality assessment model according to the obtained non-speech parameter, and obtains a total number of lost voice frames.
  • the following may be specifically as follows:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key voice frames is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost key voice frames obtained by mapping, as follows:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the number of frame loss of the voice frame, the length of the lost frame of the voice frame, and the length of the damage, and the actual number of lost frames is determined according to the number of frames lost in the voice frame.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of the lost key speech frames and the number of lost speech frames obtained by the mapping; for example, the following may be specifically as follows:
  • the specific can be:
  • N 0k is the length of the lost time of the speech frame
  • a 0k is the damage effect of the number of lost frames of the speech frame and the length of one loss on the loss of a single unmissed speech frame
  • L k is the damage length of the kth occurrence of the frame loss event
  • FLN i, k is the number of single-frame mapped speech frames in the L k speech impairment frames in the i-th frame loss event.
  • the parameters a 1 , b 1 , c 1 , a 2 , b 2 , c 2 , a 3 , b 3 and c 3 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of speech frame losses) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the voice quality assessment module calculates a voice quality of the statement according to the total number of lost voice frames, as follows:
  • the number of lost speech frames for a statement is:
  • FLN f(FLN 1 , FLN 2 ,...,FLN M );
  • M is the number of lost frame events in each statement
  • FLN i is the total number of lost speech frames obtained by mapping each frame loss event.
  • the function can be as follows:
  • MOS 0 f(R);
  • the function can be as follows:
  • D is statement distortion
  • MOS 0 does not consider the quality of the statement that suffers from packet loss (ie, the compression distortion of the statement)
  • Q n is the quality of the statement considering the packet loss
  • a and b are model fixed parameters, a and b can Obtained through training.
  • the voice quality evaluation module evaluates the voice quality of the voice sequence according to the voice quality of each sentence, that is, synthesizes the voice quality of each sentence in the voice sequence, and obtains the voice quality Q of the voice sequence, as follows:
  • Q is the speech quality of the speech sequence
  • Q n is the quality of the statement considering the loss of the packet
  • N is the number of statements in the speech sequence.
  • this embodiment uses a bitstream layer model based method to parse the obtained network voice packet, and determines the frame content characteristics of the packet according to the parsing result, such as determining a silence frame and a voice frame, and then determining Decoding the speech sequence according to the determined frame content characteristics, and dividing the statement into multiple frame dropping events.
  • the preset speech quality assessment model evaluates the speech quality of each sentence.
  • the speech quality of the entire speech sequence is evaluated according to the speech quality of each sentence. In this scheme, the speech sequence can be divided and lost.
  • Frame event partitioning makes the frame loss mode in a single frame dropping event relatively simple, so that it is easy to study the distortion effect caused by each frame dropping event; and, because the scheme evaluates the network voice quality, the frame content is Characteristics (such as determining silence frames or speech frames) and frame dropping position are also considered factors. Therefore, compared with the prior art that the network voice quality is only measured according to the average distortion condition, the accuracy of the network voice quality assessment can be effectively improved; that is, the scheme can greatly improve the prediction accuracy, thereby Improve the accuracy of the assessment results.
  • the voice quality evaluation module performs distortion mapping on the frame loss event according to the preset voice quality evaluation model according to the obtained non-speech parameter, and obtains the total number of lost voice frames.
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key voice frames according to the number of key voice frames lost and the actual lost
  • the number of lost key speech frames maps the lost frame event to the total number of lost speech frames, as follows:
  • the specific can be:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the loss length and the damage length of the voice frame, and the number of key voice frames actually lost is determined according to the number of frames lost by the voice frame, according to the The number of key speech frames actually lost and the number of lost speech frames obtained by mapping are mapped to the total number of lost speech frames; for example, the specifics are as follows:
  • the specific can be:
  • the FLN i,k is the number of voice frames mapped in the L k speech impairment frames in the i-th frame loss event
  • a 0k is the damage impact of the one-time loss length on the single un-lost speech frame in the frame dropping event
  • parameter a 1 , b 1 , c 1 , a 2 , b 2 and c 2 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the processing manners of the distortion mapping are consistent between the embodiment and the second and third embodiments.
  • the solution adopted in this embodiment only needs to consider the non-critical speech frame.
  • the number of lost frames of the speech frame, the length of the lost speech frame, and the length of the damage the number of frames lost in the speech frame must be considered. In actual application, you can make your own choice according to your needs.
  • the implementation of the other steps is the same as the second and third embodiments except that the distortion mapping method is slightly different from the second and third embodiments. Examples two and three.
  • This embodiment can also achieve the same beneficial effects as the second and third embodiments.
  • the extracted non-speech parameters mainly include the distance between the key speech frame and the key speech frame, the number of lost frames of the speech frame, the length of the lost speech frame, and the length of the damage, and the like.
  • the non-speech parameters extracted in this embodiment may include the distance between the non-critical speech frame and the key speech frame, the number of lost frames of the speech frame, the average lost length, and the average damage length, as follows:
  • the distance L j between the non-critical speech frame and the key speech frame According to the auditory perception characteristic of the human ear, the farther the lost non-critical speech frame is from the adjacent key speech frame, the smaller the distortion is caused.
  • the number of lost frames of the voice frame N 1 refers to the number of lost voice frames in the frame loss event.
  • Damage length L k refers to the number of speech frames that have not been lost in two consecutive frame dropping events.
  • the distortion mapping of the subsequent distortion events is also different.
  • the distortion of the single frame needs to be calculated.
  • the distortion of the entire frame dropping event can be directly calculated, as follows:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping.
  • the specific can be:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the frame loss event is mapped to the total number of lost voice frames according to the average loss length and the average damage length, and the formula can be:
  • the FLN i maps the frame loss event to the total number of lost speech frames, N 0 is the average loss length of the speech frame, and L is the damage length; wherein the parameters a 1 , b 1 , c 1 , a 2 , b 2 , a 3 and b 3 can be obtained through training.
  • the embodiment in the case of continuous frame dropping, the embodiment is consistent with the processing manners of the distortion mapping in the second, third and fourth embodiments, but in the case of the discrete frame dropping, the solution and the embodiment adopted in this embodiment are as follows.
  • the three and four are different.
  • the distortion of the single frame needs to be calculated, and then the distortion of the single frame is integrated to obtain the distortion of the entire frame loss event.
  • the distortion of the entire frame dropping event is calculated according to the average lost length of the speech frame, the length of the damage, and the like.
  • This embodiment can also achieve the same beneficial effects as the second, third, and fourth embodiments.
  • the embodiment of the present invention further provides a network voice quality evaluation apparatus.
  • the network voice quality evaluation apparatus includes an obtaining unit 401, a parsing unit 402, a determining unit 403, a dividing unit 404, an extracting unit 405, and Evaluation unit 406.
  • the obtaining unit 401 is configured to acquire a data packet of the network voice, where the data packet of the network voice includes a voice sequence;
  • the parsing unit 402 is configured to parse the data packet acquired by the obtaining unit to obtain an analysis result.
  • a determining unit 403 configured to determine, according to the parsing result obtained by the parsing unit, a frame content of the data packet A feature, wherein the frame content characteristics may include a silence frame and a voice frame.
  • the dividing unit 404 is configured to perform statement segmentation on the speech sequence according to the frame content characteristic determined by the determining unit, and divide the segmented statement into a plurality of frame dropping events.
  • the extracting unit 405 is configured to extract a non-speech parameter according to the frame dropping event divided by the dividing unit, where the non-speech parameter includes a position parameter and a discrete distribution parameter.
  • the evaluation unit 406 is configured to evaluate the voice quality of each sentence according to the preset voice quality evaluation model according to the non-speech parameter extracted by the extracting unit, and obtain the voice quality of each sentence, and evaluate the voice quality according to each statement.
  • the voice quality of the speech sequence is configured to evaluate the voice quality of each sentence according to the preset voice quality evaluation model according to the non-speech parameter extracted by the extracting unit, and obtain the voice quality of each sentence, and evaluate the voice quality according to each statement. The voice quality of the speech sequence.
  • the method of parsing the data packets is also different.
  • the details may be as follows:
  • the parsing unit 402 may be configured to parse the packet header of the data packet to obtain an analysis result, where the parsing result may include a duration of the speech sequence, a bit number of the speech sequence, a frame loss position, and a voice load.
  • the method for obtaining the above parameters may be specifically as follows:
  • Duration i Timestamp i+1 -Timestamp i .
  • Timestamp i is the timestamp of the i-th data packet
  • Timestamp i+1 is the timestamp of the i+ 1th data packet, which can be read from the RTP header of the data packet.
  • B i LIP i -HIP i -HUDP i -HRTP i .
  • the LIP i is the number of bits of the i-th data packet, can be obtained directly from the IP header;
  • the HIP i is the IP protocol header length of the i-th data packet,
  • HUDP i is UDP i-th data packet,
  • HRTP i for the first The length of the RTP protocol header of i IP packets.
  • the voice load and the voice duration Duration max of the i-th packet are recorded, wherein the voice load refers to the number of RTP payload bits when the packet load is maximum, which is denoted as B max . It is generally considered that the i-th data packet is non-silent, and the unmuted code rate of the i-th data packet is:
  • sequence number field in the RTP header indicates the order of the data packets, and the position (in terms of the frame loss position) and the number of the lost frames can be determined according to the RTP sequence number of each data packet.
  • the determining unit 403 is specifically configured to determine, in the data packet, a missing frame portion that needs to be detected according to the frame loss position, and determine, according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load, respectively.
  • a frame content characteristic of a previous adjacent unmissed frame of the lost frame portion and a frame content characteristic of the next adjacent unreduced frame, and a flag determining the next adjacent unrecognized frame, according to the previous neighbor The frame content characteristics of the unmissed frame, the frame content characteristics of the next adjacent unreduced frame, and the flag of the next adjacent unreduced frame determine the frame content characteristics of the lost frame portion.
  • the determining, according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load, the frame content characteristic of the unmissed frame may specifically include: obtaining an actual payload length of the unmissed frame; and according to the voice load, the bit of the voice sequence
  • the duration of the number and the voice sequence determines the code rate (ie, the coding rate); if the standard payload length corresponding to the code rate is consistent with the actual payload length, determining that the unrecovered frame is a speech frame; if the code rate corresponds to If the standard payload length does not match the actual payload length, it is determined that the unrecovered frame is a silent frame, that is:
  • the determining unit 403 may be specifically configured to obtain an actual payload length of the unmissed frame; determine a code rate according to the voice load, the number of bits of the voice sequence, and the duration of the voice sequence; if the standard payload length corresponding to the code rate Consistent with the actual payload length, determining that the unrecovered frame is a voice frame; if the standard payload length corresponding to the code rate is inconsistent with the actual payload length, determining that the unrecovered frame is muted frame.
  • determining the frame content characteristics of the lost frame portion according to the frame content characteristics of the previous adjacent un-missed frame, the frame content characteristics of the next adjacent un-lossed frame, and the flag of the next adjacent un-missed frame Specifically, it may be as follows: if it is determined that the previous adjacent un-reduced frame and the next adjacent un-reduced frame are both silence frames, or the flag of the next adjacent un-missed frame indicates the next adjacent one. If the unrecognized frame is the first speech frame (such as labeled as 1), it is determined that the lost frame portion is a silent frame, otherwise, the lost frame portion is determined to be a voice frame, that is:
  • the determining unit 403 may be specifically configured to: if it is determined that the previous adjacent un-lossed frame and the subsequent adjacent un-missed frame are both silence frames, or the flag of the next adjacent un-missed frame indicates the next adjacent one If the lost frame is the first voice frame, it is determined that the lost frame portion is a silent frame, otherwise, the lost frame portion is determined to be a voice frame.
  • the speech frame can also be divided into key speech frames and non-critical speech frames, so that different processing can be subsequently performed for these key speech frames and non-critical speech frames.
  • the key speech frame refers to a frame that has a great influence on the voice quality
  • the key speech frame refers to a frame that has less influence on the voice quality.
  • the “determining the lost frame part is a voice frame” may specifically include the following situations:
  • the determining unit 403 can be specifically used to perform the operations of the above a to c.
  • bit flow model needs to parse the voice payload part in addition to the packet header of the data packet, as follows:
  • the parsing unit 402 may be configured to parse the packet header of the data packet to obtain an analysis result, where the parsing result includes a duration of the speech sequence, a bit number of the speech sequence, a frame loss position, a voice load, and the like;
  • the AMR decoding is performed on the voice load to obtain the AMR decoded speech signal; and the frame energy and the average frame energy of each frame in the AMR decoded speech signal are calculated according to the duration of the speech sequence and the number of bits of the speech sequence;
  • the determining unit 403 may be specifically configured to determine, in the data packet, a portion of the lost frame that needs to be detected according to the frame loss position, and determine the loss according to the calculated frame energy and the average frame energy.
  • the frame content characteristic of the previous adjacent unremoved frame of the frame portion and the frame content characteristic of the subsequent adjacent unmissed frame, according to the frame content characteristics of the previous adjacent unreduced frame and the subsequent adjacent determine the frame content characteristics of the lost frame portion.
  • the frame content characteristic of the unmissed frame is determined according to the calculated frame energy and the average frame energy, including: if the frame energy of the unreduced frame is less than or equal to 0, determining that the unrecovered frame is a silence frame; if the unrecovered frame If the frame energy is greater than 0 and less than the average frame energy, it is determined that the unrecovered frame is a non-critical speech frame; if the frame energy of the unreduced frame is greater than the average frame energy, it is determined that the unrecovered frame is a key speech frame. which is:
  • the determining unit 403 is specifically configured to: if the frame energy of the unreduced frame is less than or equal to 0, determine that the unrecovered frame is a silence frame; if the frame energy of the unreduced frame is greater than 0 and less than the average frame energy, determine The unrecognized frame is a non-critical speech frame; if the frame energy of the unreduced frame is greater than the average frame energy, determining that the unrecovered frame is a key speech frame.
  • determining the frame content characteristic of the lost frame portion according to the frame content characteristic of the previous adjacent unremoved frame and the frame content characteristic of the next adjacent unreduced frame may be specifically as follows:
  • the loss is The second half of the frame portion is the key speech frame
  • the determining unit 403 can be specifically used to perform the operations of the above a to i.
  • the dividing unit 404 may be specifically configured to: when determining that the number of consecutive frames of the silence frame exceeds a preset number of times, dividing the voice sequence before the silence frame into a statement; and deleting the frame twice in the determining statement When the distance of the part is less than or equal to the preset distance, the adjacent two frame dropping portions are determined as one frame dropping event; and, when the distance between the adjacent two frame dropping portions in the determination statement is greater than the preset distance, The two adjacent frame dropping frames are determined as two frame dropping events.
  • the preset number of times and the preset distance may be set according to actual application requirements, for example, the preset number of times may be set to 6, the preset distance may be set to 10, and the like.
  • the evaluation unit 406 may be specifically configured to perform distortion mapping on the frame loss event according to the preset voice quality evaluation model according to the non-speech parameter extracted by the extracting unit 405, to obtain a total number of lost voice frames; according to the lost voice.
  • the total number of frames calculates the speech quality of the statement.
  • the step of “distorting the frame loss event according to the preset voice quality evaluation model according to the non-speech parameter to obtain the total number of lost voice frames” may be implemented by any one of the following methods:
  • the non-speech parameters may include a position parameter and a discrete distribution parameter, etc., wherein the non-speech parameter may include a distance between the non-critical speech frame and the key speech frame, a number of lost frames of the speech frame, a length of the lost speech frame, and a length of the damage.
  • the evaluation unit 406 can be specifically configured to perform the following operations:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key voice frames is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost key voice frames obtained by mapping, as follows:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the number of frame loss of the voice frame, the length of the lost frame of the voice frame, and the length of the damage, and the actual number of lost frames is determined according to the number of frames lost in the voice frame.
  • the number of lost key speech frames is mapped to the total number of lost speech frames according to the number of the lost key speech frames and the number of lost speech frames obtained by the mapping; for example, the following may be specifically as follows:
  • the specific can be:
  • N 0k is the length of the lost time of the speech frame
  • a 0k is the damage effect of the number of lost frames of the speech frame and the length of one loss on the loss of a single unmissed speech frame
  • L k is the damage length of the kth occurrence of the frame loss event
  • FLN i, k is the number of single-frame mapped speech frames in the L k speech impairment frames in the i-th frame loss event.
  • the parameters a 1 , b 1 , c 1 , a 2 , b 2 , c 2 , a 3 , b 3 and c 3 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the non-speech parameters may include a position parameter and a discrete distribution parameter, etc., wherein the non-speech parameter includes a distance between the non-critical speech frame and the key speech frame, a number of frames lost by the speech frame, a length of the lost speech frame, and a length of the damage.
  • the evaluation unit 406 can be specifically configured to perform the following operations:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the loss length and the damage length of the voice frame, and the number of key voice frames actually lost is determined according to the number of frames lost by the voice frame, according to the The number of key speech frames actually lost and the number of lost speech frames obtained by mapping are mapped to the total number of lost speech frames; for example, the specifics are as follows:
  • the specific can be:
  • the FLN i,k is the number of voice frames mapped in the L k speech impairment frames in the i-th frame loss event
  • a 0k is the damage impact of the one-time loss length on the single un-lost speech frame in the frame dropping event
  • parameter a 1 , b 1 , c 1 , a 2 , b 2 and c 2 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the third method no longer calculates the distortion of a single frame, but directly calculates the distortion of the entire frame loss event.
  • the non-speech parameter may include a position parameter and a discrete distribution parameter, and the non-speech parameter may include a distance between the non-key speech frame and the key speech frame, a number of lost frames of the speech frame, an average loss length, and an average damage length.
  • the evaluation unit 406 can be specifically configured to perform the following operations:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping.
  • the frame loss event is mapped to the total number of lost voice frames according to the average loss length and the average damage length, and the formula can be:
  • the FLN i maps the frame loss event to the total number of lost speech frames, N 0 is the average loss length of the speech frame, and L is the damage length; wherein the parameters a 1 , b 1 , c 1 , a 2 , b 2 , a 3 and b 3 can be obtained through training.
  • the speech quality of the sentence can be calculated according to the total number of lost speech frames, as follows:
  • the number of lost speech frames for a statement is:
  • FLN f(FLN 1 , FLN 2 ,...,FLN M );
  • M is the number of lost frame events in each statement
  • FLN i is the total number of lost speech frames obtained by mapping each frame loss event.
  • MOS 0 f(R);
  • the quality evaluation can directly look up the table to obtain MOS 0 .
  • the foregoing various units may be implemented as a separate entity, and may be implemented in any combination, and may be implemented as the same entity or a plurality of entities.
  • the foregoing various units refer to the foregoing embodiments, and details are not described herein again.
  • the network voice quality assessment device may be specifically integrated in a network side device such as a server.
  • the parsing unit 402 of the network voice quality evaluation apparatus of the embodiment can parse the data packet of the network voice acquired by the obtaining unit 401, and the determining unit 403 determines the frame content characteristic of the data packet according to the parsing result. For example, it is determined to be a silence frame and a voice frame, and then the division unit 404 performs a statement division on the speech sequence according to the determined frame content characteristic, and divides the statement into a plurality of frame loss events, and extracts non-speech parameters according to the frame loss event in the extraction unit 405.
  • the evaluation unit 406 evaluates the speech quality of each sentence according to the preset speech quality evaluation model according to the non-speech parameter, and finally, evaluates the entire speech according to the speech quality of each sentence.
  • the speech quality of the sequence in this scheme, the segmentation mode of the speech sequence and the frame dropping event can be divided, so that the frame loss mode in a single frame dropping event is relatively simple, so that it is easy to study each frame dropping event.
  • the embodiment of the present invention further provides a communication system, which includes any network voice quality evaluation device provided by the embodiment of the present invention.
  • a communication system which includes any network voice quality evaluation device provided by the embodiment of the present invention.
  • the communication system can include any of the network voice quality evaluation devices provided by the embodiments of the present invention
  • the beneficial effects of any of the network voice quality assessment devices provided by the embodiments of the present invention can be implemented. The embodiment is not described here.
  • an embodiment of the present invention further provides a network side device, including a memory 501 for storing data, a transceiver interface 502 for transmitting and receiving data, and a processor 503; wherein:
  • the processor 503 is configured to obtain a data packet of the network voice through the transceiver interface 502, where the data packet of the network voice includes a voice sequence, parse the data packet to obtain an analysis result, and determine the data packet according to the analysis result.
  • the content of the frame content is segmented according to the determined content of the frame, and the divided statement is divided into multiple frame dropping events; the non-speech parameters are extracted according to the frame dropping event; and the non-speech parameters are preset according to the non-speech parameters.
  • the speech quality assessment model evaluates the speech quality of each sentence to obtain the speech quality of each sentence; and evaluates the speech quality of the speech sequence according to the speech quality of each statement.
  • the processor 503 analyzes the data packet according to different network voice evaluation models. For example, taking the cladding model and the bitstream layer model as an example, the processor 503 can analyze the data packet as follows:
  • the packet header of the data packet can be parsed to obtain an analysis result.
  • the parsing result may include the duration of the speech sequence, the number of bits of the speech sequence, the frame loss position, and the voice load.
  • the method for obtaining the above parameters may be specifically as follows:
  • Duration i Timestamp i+1 -Timestamp i .
  • Timestamp i is the timestamp of the i-th data packet
  • Timestamp i+1 is the timestamp of the i+ 1th data packet, which can be read from the real-time transport protocol (RTP) header of the data packet.
  • RTP real-time transport protocol
  • B i LIP i -HIP i -HUDP i -HRTP i .
  • LIP i is the number of bits of the i-th data packet, which can be directly obtained by the IP header; HIP i is the IP protocol header length of the i-th data packet, and HUDP i is the user data packet protocol of the i-th data packet (UDP) User Datagram Protocol) The length of the protocol header.
  • HRTP i is the length of the RTP protocol header of the i-th IP packet.
  • the voice load and the voice duration Duration max of the i-th packet are recorded, wherein the voice load refers to the number of RTP payload bits when the packet load is maximum, which is denoted as B max . It is generally considered that the i-th data packet is non-silent, and the unmuted code rate of the i-th data packet is:
  • sequence number field in the RTP header indicates the order of the packets, and the location (in terms of the frame loss) and the number of lost frames can be determined based on the RTP sequence number of each packet.
  • bit flow model needs to parse the voice payload part in addition to the packet header of the data packet, as follows:
  • AMR adaptive multi-rate decoding
  • the frame energy of each frame can be quantized according to the auditory characteristics and subjective experience of the human ear. If the frame energy is greater than 0, it is a speech frame, and the average capability of the speech frame is calculated according to this, and the image is obtained. Average frame energy.
  • determining the frame content characteristics of the data packet can be specifically as follows:
  • A. Determine, in the data packet, the portion of the lost frame that needs to be detected according to the frame loss position.
  • the determining, according to the duration of the voice sequence, the number of bits of the voice sequence, and the voice load, the frame content characteristics of the unmissed frame may include:
  • a code rate ie, a coding rate
  • the unrecovered frame is a silence frame.
  • a speech frame (such as labeled 1) determines that the lost frame portion is a silence frame, otherwise it determines that the lost frame portion is a speech frame.
  • the speech frame can also be divided into key speech frames and non-critical speech frames, so that different processing can be subsequently performed for these key speech frames and non-critical speech frames.
  • the key speech frame refers to the frame that has a great influence on the speech quality
  • the key speech frame refers to It is a frame that has less impact on voice quality.
  • the voice frame is divided into a key voice frame and a non-key voice frame, the following steps may be included in the step of determining that the lost frame portion is a voice frame:
  • the frame content detection of the bitstream layer model is more elaborate than the cladding model, for example, where the speech frame can include key speech frames and non-critical speech frames, and the like.
  • the operation of "determining the frame content characteristics of the packet according to the obtained parsing result" may specifically be as follows:
  • A. Determine, in the data packet, a portion of the lost frame that needs to be detected according to the frame loss position.
  • the frame content characteristics of the unmissed frame are determined according to the calculated frame energy and the average frame energy, including:
  • the frame energy of the unreduced frame is greater than the average frame energy, it is determined that the unrecovered frame is a key speech frame.
  • the loss is The second half of the frame portion is the key speech frame
  • the processor 503 may perform the following operations when dividing the statement and the frame dropping event:
  • the voice sequence before the silence frame is divided into sentences
  • the adjacent two frame dropping portions are determined as one frame dropping event.
  • the adjacent two frame dropping portions are determined as two frame dropping events.
  • the preset number of times and the preset distance may be set according to actual application requirements, for example, the preset number of times may be set to 6, the preset distance may be set to 10, and the like.
  • the processor 503 evaluates the model according to the preset voice quality according to the non-speech parameter.
  • the frame loss event may be subjected to distortion mapping according to the preset voice quality evaluation model, and the total number of lost voice frames is obtained, and then the total number of lost voice frames is calculated according to the total number of lost voice frames.
  • the voice quality of the statement is evaluated.
  • the "distortion mapping of the frame loss event according to the preset voice quality evaluation model according to the non-speech parameter to obtain the total number of lost voice frames” may be implemented by any one of the following methods:
  • the non-speech parameters may include a position parameter and a discrete distribution parameter, etc., wherein the non-speech parameter may include a distance between the non-critical speech frame and the key speech frame, a number of lost frames of the speech frame, a length of the lost speech frame, and a length of the damage.
  • the step of “distorting the frame loss event according to the preset voice quality evaluation model according to the non-speech parameter to obtain the total number of lost voice frames” may include:
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key voice frames is mapped to the total number of lost voice frames according to the number of key voice frames actually lost and the number of lost key voice frames obtained by mapping, as follows:
  • FLN i,j is the number of key speech frames mapped by the jth non-key speech frame in the i-th frame loss event
  • L j is the distance between the j-th non-key speech frame and the key speech frame.
  • the total number of lost speech frames can be:
  • the FLN i is the total number of lost speech frames (ie, the total number of key speech frame losses) obtained by mapping the i-th frame loss event, and n i represents the number of key speech frames actually lost.
  • the damage length maps the damage frame in the frame loss event to the number of lost voice frames, and determines the number of key voice frames actually lost according to the number of frame loss frames of the voice frame, and the loss according to the actual lost key voice frame number and mapping
  • the number of voice frames maps the frame loss event to the total number of lost voice frames; for example, the details may be as follows:
  • N 0k is the length of the lost time of the speech frame
  • a 0k is the damage effect of the number of lost frames of the speech frame and the length of one loss on the loss of a single unmissed speech frame
  • L k is the damage length of the kth occurrence of the frame loss event
  • FLN i, k is the number of single-frame mapped speech frames in the L k speech impairment frames in the i-th frame loss event.
  • the parameters a 1 , b 1 , c 1 , a 2 , b 2 , c 2 , a 3 , b 3 and c 3 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the non-speech parameters may include a position parameter and a discrete distribution parameter, etc., wherein the non-speech parameter includes a distance between the non-critical speech frame and the key speech frame, a number of frames lost by the speech frame, a length of the lost speech frame, and a length of the damage.
  • the non-key speech frame in the frame loss event is mapped to the number of lost key speech frames according to the distance between the non-critical speech frame and the key speech frame, and the actual lost frame is determined according to the number of lost frames of the speech frame.
  • the number of key speech frames is mapped to the total number of lost speech frames according to the number of key speech frames actually lost and the number of lost key speech frames obtained by mapping.
  • the damage frame in the frame loss event is mapped to the number of lost voice frames according to the loss length and the damage length of the voice frame, and the number of key voice frames actually lost is determined according to the number of frames lost by the voice frame, according to the The number of key speech frames actually lost and the number of lost speech frames obtained by mapping are mapped to the total number of lost speech frames; for example, the specifics are as follows:
  • the FLN i,k is the number of voice frames mapped in the L k speech impairment frames in the i-th frame loss event
  • a 0k is the damage impact of the one-time loss length on the single un-lost speech frame in the frame dropping event
  • parameter a 1 , b 1 , c 1 , a 2 , b 2 and c 2 can be obtained by training.
  • the frame loss event is mapped to the total number of lost voice frames, which can be:
  • FLN i is the total number of lost speech frames (ie, the total number of lost speech frames) obtained by mapping the frame loss event, and n i represents the number of key speech frames actually lost.
  • the third method no longer calculates the distortion of a single frame, but directly calculates the distortion of the entire frame loss event.
  • the non-speech parameter may include a position parameter and a discrete distribution parameter, and the non-speech parameter may include a distance between the non-key speech frame and the key speech frame, a number of lost frames of the speech frame, an average loss length, and an average damage length.
  • the mapping of the lost frame at different positions in the frame loss event and the lost frame in the different discrete distributions to the total number of the lost voice frames according to the non-speech parameter which may include:
  • the frame When determining the continuous frame loss, the frame will be lost according to the distance between the non-critical voice frame and the key voice frame.
  • the non-critical speech frames in the device are mapped into the number of lost key speech frames, and the number of key speech frames actually lost is determined according to the number of lost frames of the speech frame, and the lost key speech frames are obtained according to the actual number of lost key speech frames and the mapping. The number is mapped to the total number of lost speech frames.
  • the frame loss event is mapped to the total number of lost voice frames according to the average loss length and the average damage length, and the formula can be:
  • the FLN i maps the frame loss event to the total number of lost speech frames, N 0 is the average loss length of the speech frame, and L is the damage length; wherein the parameters a 1 , b 1 , c 1 , a 2 , b 2 , a 3 and b 3 can be obtained through training.
  • the speech quality of the sentence can be calculated according to the total number of lost speech frames, as follows:
  • the number of lost speech frames for a statement is:
  • FLN f(FLN 1 , FLN 2 ,...,FLN M );
  • M is the number of lost frame events in each statement
  • FLN i is the total number of lost speech frames obtained by mapping each frame loss event.
  • MOS 0 f(R);
  • the quality evaluation can directly look up the table to obtain MOS 0 .
  • the network side device in this embodiment parses the data packet of the acquired network voice, and determines the frame content characteristics of the data packet according to the parsing result, for example, determining the silence frame and the voice frame, and then determining according to the determined
  • the frame content feature divides the speech sequence into sentences, and divides the statement into multiple frame dropping events. After extracting non-speech parameters (including position parameters and discrete distribution parameters) according to the frame dropping event, the non-speech parameters are preset according to the non-speech parameters.
  • the speech quality assessment model evaluates the speech quality of each sentence. Finally, the speech quality of the entire speech sequence is evaluated according to the speech quality of each sentence. In this scheme, the speech segmentation and the frame dropping event can be divided by the speech sequence.
  • the frame loss mode in a single frame dropping event is relatively simple, so that it is easy to study the distortion effect caused by each frame dropping event; and, because the scheme evaluates the network voice quality, the frame content characteristics (such as Determining whether a silence frame or a speech frame) and the frame loss position are also considered factors, therefore, relative to
  • the scheme for measuring the voice quality of the network according to the average distortion condition can effectively improve the accuracy of the network voice quality assessment; that is, the scheme can greatly improve the prediction accuracy, thereby improving the evaluation result. accuracy.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), disk or optical disk.
  • ROM Read Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明实施例公开了一种网络语音质量评估方法、装置和系统。本发明实施例采用对获取到的网络语音的数据包进行解析,并根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在根据丢帧事件提取非语音参数后,根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;采用该方案,可以大大地提高预测精度,提高评估结果的准确性。

Description

一种网络语音质量评估方法、装置和系统
本申请要求于2014年5月5日提交中国专利局、申请号为201410186706.1、发明名称为“一种网络语音质量评估方法、装置和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及通信技术领域,具体涉及一种网络语音质量评估方法、装置和系统。
背景技术
近年来,网络电话(VoIP,Voice over Internet Protocol)发展迅速,与传统电话相比,VoIP具有占用网络资源少、成本低等优势。然而网际协议(IP,Internet Protocol)网络只提供尽力而为的服务,语音在传输的过程中会受到丢包等多种网络因素的影响,造成网络语音质量的下降,通过对网络语音质量的监控和反馈,可以调整压缩或传输参数,改善网络语音质量。因此,如何对网络语音质量进行实时和准确可靠的测量与评估是网络测量和网络规划设计中相当关键的问题。
根据模型输入信息类型以及对码流的介入程度,语音质量评估方法可以分为:参数规划模型、包层模型、比特流层模型、媒体层模型以及混合模型等;其中,基于包层模型的网络语音质量评估方法只允许通过分析语音数据包的包头信息评估语音的质量,其计算复杂度低,且适用于无法介入数据包载荷信息的情形。而基于比特流层模型的网络语音质量评估方法则不仅允许分析语音数据包的包头信息,而且还可以对语音负载信息进行分析,甚至可以进行语音解码,比如分析语音信号的波形,以获得更加详细的丢帧和失真信息,从而获得比基于包层模型的网络语音质量评估方法更加精确的预测质量,但是其计算复杂度交包层模型要高。因此,这两种方法各有其优势,是现有技术较为常用的两种网络语音质量评估方法。但是,不管是基于包层模型的网络语音质量评估方法,还是基于比特流层模型的网络语音质量评估方法,在现有技术中,一般都会利用语音的平均压缩码率来评估压缩失真,以及利用平均丢包率来评估丢包引起的失真,然后再根据压缩失真和丢包引起的失真来评估网络语音质量。 在对现有技术的研究和实践过程中,本发明的发明人发现,由于语音的组成较为复杂,比如,语音中经常会出现静音(如说话的间隙等)等情况,而现有方案仅仅只根据其平均失真情况来衡量其语音质量,因此,其预测精度不高,评估结果并不够准确。
发明内容
本发明实施例提供一种网络语音质量评估方法、装置和系统,可以提高预测精度,从而提高评估结果的准确性。
第一方面,本发明实施例提供一种网络语音质量评估方法,包括:
获取网络语音的数据包,所述网络语音的数据包包括语音序列;
对所述数据包进行解析,得到解析结果;
根据所述解析结果确定所述数据包的帧内容特性,所述帧内容特性包括静音帧和语音帧;
根据确定的帧内容特性对所述语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件;
根据所述丢帧事件提取非语音参数,所述非语音参数包括位置参数和离散分布参数;
根据所述非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量;
根据所述每条语句的语音质量评估所述语音序列的语音质量。
在第一种可能的实施方式中,结合第一方面,所述对所述数据包进行解析,得到解析结果,包括:
对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;
所述根据所述解析结果确定所述数据包的帧内容特性,包括:根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,分别根据所述语音序列的时长、语音序列的比特数和语音负载确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性。
在第二种可能的实施方式中,结合第一方面的第一种可能的实施方式,根据所述语音序列的时长、语音序列的比特数和语音负载确定未丢失帧的帧内容特性,包括:
获取未丢失帧的实际有效载荷长度;
根据所述语音负载、语音序列的比特数和语音序列的时长确定码率;
若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度一致,则确定所述未丢失帧为语音帧;
若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度不一致,则确定所述未丢失帧为静音帧。
在第三种可能的实施方式中,结合第一方面的第二种可能的实施方式,所述根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性,包括:
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧,则确定所述丢失帧部分为静音帧,否则,确定所述丢失帧部分为语音帧。
在第四种可能的实施方式中,结合第一方面的第三种可能的实施方式,所述语音帧包括关键语音帧和非关键语音帧,则在所述确定所述丢失帧部分为语音帧包括:
在确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧时,确定所述丢失帧部分为关键语音帧;
在确定所述前一个相邻的未丢失帧为语音帧,以及所述后一个相邻的未丢失帧为静音帧时,确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;
在确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为语音帧时,确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧。
在第五种可能的实施方式中,结合第一方面,所述对所述数据包进行解析,得到解析结果,包括:
对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序 列的时长、语音序列的比特数、丢帧位置和语音负载;
根据所述语音负载进行自适应多速率AMR解码,得到AMR解码后语音信号;
根据所述语音序列的时长和语音序列的比特数计算所述AMR解码后语音信号中每一帧的帧能量和平均帧能量;
所述根据所述解析结果确定所述数据包的帧内容特性,包括:根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,根据计算出的帧能量和平均帧能量确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性。
在第六种可能的实施方式中,结合第一方面的第五种可能的实施方式,根据计算出的帧能量和平均帧能量确定未丢失帧的帧内容特性,包括:
若所述未丢失帧的帧能量小于等于0,则确定所述未丢失帧为静音帧;
若所述未丢失帧的帧能量大于0小于平均帧能量,则确定所述未丢失帧为非关键语音帧;
若所述未丢失帧的帧能量大于平均帧能量,则确定所述未丢失帧为关键语音帧。
在第七种可能的实施方式中,结合第一方面的第六种可能的实施方式,所述根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性,包括:
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,则确定所述丢失帧部分为静音帧;
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定所述丢失帧部分为关键语音帧;
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定所述丢失帧部分为非关键语音帧;
若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;
若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为关键语音帧,则确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧;
若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为关键语音帧;
若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为关键语音帧,则确定所述丢失帧部分为关键语音帧;
若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分为非关键语音帧;
若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为非关键语音帧。
在第八种可能的实施方式中,结合第一方面的第三、四或七种可能的实施方式,所述根据确定的帧内容特性对所述语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件,包括:
确定静音帧连续出现的帧数超过预置次数时,将所述静音帧之前的语音序列划分为语句;
确定所述语句中相邻两次丢帧部分的距离小于等于预置距离时,将所述相邻两次丢帧部分确定为一次丢帧事件;
确定所述语句中相邻两次丢帧部分的距离大于预置距离时,将所述相邻两次丢帧部分确定为两次丢帧事件。
在第九种可能的实施方式中,结合第一方面的第三、四或七种可能的实施方式,所述根据所述非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,包括:
根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数;
根据所述丢失的语音帧总数计算语句的语音质量。
在第十种可能的实施方式中,结合第一方面的第九种可能的实施方式,所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则所述根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数,包 括:
确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;或者,
确定离散丢帧时,根据所述语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数。
在第十一种可能的实施方式中,结合第一方面的第九种可能的实施方式,所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则所述根据所述非语音参数将所述丢帧事件中不同位置下的丢失帧和不同离散分布下的丢失帧映射成丢失的语音帧总数,包括:
确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
确定离散丢帧时,根据所述平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数。
第二方面,本发明实施例还提供一种网络语音质量评估装置,包括获取单元、解析单元、确定单元、划分单元、提取单元和评估单元,其中:
获取单元,用于获取网络语音的数据包,所述网络语音的数据包包括语音序列;
解析单元,用于对获取单元获取到的数据包进行解析,得到解析结果;
确定单元,用于根据解析单元得到的解析结果确定所述数据包的帧内容特 性,所述帧内容特性包括静音帧和语音帧;
划分单元,用于根据确定单元确定的帧内容特性对所述语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件;
提取单元,用于根据划分单元划分的丢帧事件提取非语音参数,所述非语音参数包括位置参数和离散分布参数;
评估单元,用于根据提取单元提取的非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,根据所述每条语句的语音质量评估所述语音序列的语音质量。
在第一种可能的实施方式中,结合第二方面,其中:
所述解析单元,具体用于对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;
所述确定单元,具体用于根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,分别根据所述语音序列的时长、语音序列的比特数和语音负载确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,以及确定后一个相邻的未丢失帧的标记,根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性。
在第二种可能的实施方式中,结合第二方面的第一种可能的实施方式,其中:
所述确定单元,具体用于获取未丢失帧的实际有效载荷长度;根据所述语音负载、语音序列的比特数和语音序列的时长确定码率;若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度一致,则确定所述未丢失帧为语音帧;若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度不一致,则确定所述未丢失帧为静音帧。
在第三种可能的实施方式中,结合第二方面的第二种可能的实施方式,其中:
所述确定单元,具体用于若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧,则确定所述丢失帧部分为静音帧,否则,确定所述丢失帧部分为语音帧。
在第四种可能的实施方式中,结合第二方面的第三种可能的实施方式,其中:所述语音帧包括关键语音帧和非关键语音帧,则:
所述确定单元,具体用于在确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧时,确定所述丢失帧部分为关键语音帧;在确定所述前一个相邻的未丢失帧为语音帧,以及所述后一个相邻的未丢失帧为静音帧时,确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;在确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为语音帧时,确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧。
在第五种可能的实施方式中,结合第二方面,其中:
所述解析单元,具体用于对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;根据所述语音负载进行自适应多速率AMR解码,得到AMR解码后语音信号;根据所述语音序列的时长和语音序列的比特数计算所述AMR解码后语音信号中每一帧的帧能量和平均帧能量;
所述确定单元,具体用于根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,根据计算出的帧能量和平均帧能量确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性。
在第六种可能的实施方式中,结合第二方面的第五种可能的实施方式,其中:
所述确定单元,具体用于若所述未丢失帧的帧能量小于等于0,则确定所述未丢失帧为静音帧;若所述未丢失帧的帧能量大于0小于平均帧能量,则确定所述未丢失帧为非关键语音帧;若所述未丢失帧的帧能量大于平均帧能量,则确定所述未丢失帧为关键语音帧。
在第七种可能的实施方式中,结合第二方面的第六种可能的实施方式,其中:所述确定单元,具体用于:
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,则确定所述丢失帧部分为静音帧;
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定所述丢失帧部分为关键语音帧;
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定所述丢失帧部分为非关键语音帧;
若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;
若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为关键语音帧,则确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧;
若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为关键语音帧;
若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为关键语音帧,则确定所述丢失帧部分为关键语音帧;
若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分为非关键语音帧;
若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为非关键语音帧。
在第八种可能的实施方式中,结合第二方面的第三、四或七种可能的实施方式,其中:
所述划分单元,具体用于确定静音帧连续出现的帧数超过预置次数时,将所述静音帧之前的语音序列划分为语句;确定所述语句中相邻两次丢帧部分的距离小于等于预置距离时,将所述相邻两次丢帧部分确定为一次丢帧事件;确定所述语句中相邻两次丢帧部分的距离大于预置距离时,将所述相邻两次丢帧部分确定为两次丢帧事件。
在第九种可能的实施方式中,结合第二方面的第三、四或七种可能的实施方式,其中:
所述评估单元,具体用于根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数;根据所述丢失的语音帧总数计算语句的语音质量。
在第十种可能的实施方式中,结合第二方面的第九种可能的实施方式,其中:所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度;则所述评估单元,具体用于:
确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;或者,
确定离散丢帧时,根据所述语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数。
在第十一种可能的实施方式中,结合第二方面的第九种可能的实施方式,其中:所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则所述评估单元,具体用于:
确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
确定离散丢帧时,根据所述平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数。
第三方面,本发明实施例还提供一种通信系统,包括本发明实施例提供的任一种网络语音质量评估装置。
本发明实施例采用对获取到的网络语音的数据包进行解析,并根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在根据丢帧事件提取非语音参数(包括位置参数和离散分布参数)后,根据该非语音 参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;由于在该方案中,可以通过对语音序列进行语句划分和丢帧事件划分,使得单个丢帧事件中的丢帧模式相对比较简单,从而易于研究每个丢帧事件所带来的失真影响;而且,由于该方案在评估网络语音质量的过程中,将帧内容特性(比如确定静音帧还是语音帧)和丢帧位置也作为考虑的因素,因此,相对于现有技术只根据其平均失真情况来衡量网络语音质量的方案而言,可以有效地提高网络语音质量评估的精度;也就是说,采用该方案,可以大大地提高预测精度,从而提高评估结果的准确性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本发明实施例提供的网络语音质量评估方法的流程图;
图1b是本发明实施例提供的网络语音质量评估方法中语音序列的划分示意图;
图1c是本发明实施例提供的网络语音质量评估方法中一次丢帧事件的分析示意图;
图2a是本发明实施例提供的网络语音质量评估装置的结构示意图;
图2b是本发明实施例中单词的音区示例图;
图2c是本发明实施例提供的网络语音质量评估方法的另一流程图;
图3是本发明实施例提供的网络语音质量评估方法的又一流程图;
图4是本发明实施例提供的网络语音质量评估装置的结构示意图;
图5是本发明实施例提供的服务器的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是 全部的实施例。基于本发明中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种网络语音质量评估方法、装置和系统。以下分别进行详细说明。
实施例一、
本实施例将从网络语音质量评估装置的角度进行描述,该网络语音质量评估装置具体可以集成在服务器等网络侧设备中。
一种网络语音质量评估方法,包括:获取网络语音的数据包,其中,该网络语音的数据包包括语音序列;对该数据包进行解析,得到解析结果;根据该解析结果确定该数据包的帧内容特性,根据确定的帧内容特性对该语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件;根据丢帧事件提取非语音参数;根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量;根据该每条语句的语音质量评估语音序列的语音质量。
如图1a所示,该网络语音质量评估方法的流程具体可以如下:
101、获取网络语音的数据包。
其中,该网络语音的数据包可以包括包头和语音载荷,其中,包头可以包括实时传输协议(RTP,Real-time Transport Protocol)头、用户数据包协议(UDP,User Datagram Protocol)头和IP头等,语音载荷可以包括语音序列等。
102、对获取到的数据包进行解析,得到解析结果。
其中,根据不同的网络语音评估模型,其对数据包的解析方法也有所不同,比如,以包层模型和比特流层模型为例,具体可以如下:
(1)包层模型;
具体可以对数据包的包头进行解析,得到解析结果。其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等。例如,以第i个数据包为例,则上述参数的获取方法具体可以如下:
第i个数据包包含的语音序列的时长Durationi
Durationi=Timestampi+1-Timestampi
其中,Timestampi为第i个数据包的时间戳,Timestampi+1为第i+1个数据包的时间戳,可以从数据包的RTP头中读取。
第i个数据包包含的语音序列的比特数Bi
Bi=LIPi-HIPi-HUDPi-HRTPi
其中,LIPi为第i个数据包的比特数,可以直接由IP头得到;HIPi为第i个数据包的IP协议头长度,HUDPi为第i个数据包的UDP头长度,HRTPi为第i个IP数据包的RTP协议头长度。
记录语音负载和该第i个数据包的语音时长Durationmax,其中,语音负载指的是数据包负载最大时的RTP负载比特数,记为Bmax。一般认为该第i个数据包是非静音,则该第i个数据包的非静音的码率为:
Figure PCTCN2014089401-appb-000001
此外,RTP头中的序列号域标示了数据包的顺序,根据每个数据包的RTP序列号就可以确定丢失帧的位置(就丢帧位置)和数量。
(2)比特流层模型;
与包层模型不同的是,比特流程模型除了需要对数据包的包头进行解析之外,还需要对语音负载部分也进行解析,如下:
A、对数据包的包头进行解析,得到解析结果,其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等信息,这些信息的具体获取与包层模型相同,在此不再赘述。
B、根据语音负载进行自适应多速率(AMR,Adaptive Multi-Rate)解码,得到AMR解码后语音信号。
C、根据语音序列的时长和语音序列的比特数计算该AMR解码后语音信号中每一帧的帧能量和平均帧能量。
其中,每一帧的帧能量可以根据人耳的听觉特性和主观体验进行量化而得,若该帧能量大于0,则为语音帧,并依此计算语音帧的平均能力,得到平均帧能量。
103、根据得到的解析结果确定该数据包的帧内容特性。
其中,帧内容特性可以包括静音帧和语音帧,即该步骤(步骤103)可以对该数据包的帧内容进行检测,以确定是语音帧还是静音帧,其中,语音帧还可以进一步分为关键语音帧和非关键语音帧。
对于不同的网络语音评估模型,由于其获取到的解析结果不同,因此,其确定数据包的帧内容特性的方法也有所不同,比如,还是以包层模型和比特流层模型为例,具体可以如下:
(1)包层模型;
A、根据该丢帧位置在该数据包中确定当前需要检测的丢失帧部分。
其中,一个丢失帧部分可以包括连续的多个丢失帧。
B、分别根据该语音序列的时长、语音序列的比特数和语音负载确定该丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,以及确定该后一个相邻的未丢失帧的标记(mark),该标记为帧的序号。
其中,根据所述语音序列的时长、语音序列的比特数和语音负载确定未丢失帧的帧内容特性,具体可以包括:
获取未丢失帧的实际有效载荷长度;
根据该语音负载、语音序列的比特数和语音序列的时长确定码率(即编码速率);
若该码率所对应的标准有效载荷长度与该实际有效载荷长度一致,则确定该未丢失帧为语音帧;
若该码率所对应的标准有效载荷长度与实际有效载荷长度不一致,则确定该未丢失帧为静音帧。
其中,具体可以设置一张对应表,用于记录码率和标准有效载荷长度的对应关系,这样,根据码率查找该对应表便可得到相应的标准有效载荷长度。例如,具体可参见表一。
表一:
编码模式 码率(kb/s) 标准有效载荷长度(Byte)
AMR475 4.75 14
AMR515 5.15 15
AMR59 5.9 17
AMR67 6.7 19
AMR74 7.4 21
AMR795 7.95 22
AMR102 10.2 28
AMR122 12.2 33
…… …… ……
根据表一可知,在AMR475编码模式下,码率4.75kb/s对应的标准有效载荷长度为14字节(Byte),因此,如果未丢失帧的实际有效载荷为14字节,则为语音帧,否则,若未丢失帧的实际有效载荷不是14字节,则为静音帧,以此类推,等等。
C、根据该前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定该丢失帧部分的帧内容特性,例如,具体可以如下:
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧(比如标记为1),则确定所述丢失帧部分为静音帧,否则,均确定该丢失帧部分为语音帧。
此外,为了进一步提高预测精度,还可以将语音帧划分为关键语音帧和非关键语音帧,以便后续可以针对这些关键语音帧和非关键语音帧作出不同的处理。其中,关键语音帧指的是对语音质量的影响较大的帧,而关键语音帧指的是对语音质量的影响较小的帧。
若将语音帧划分为关键语音帧和非关键语音帧,则在步骤“确定该丢失帧部分为语音帧”具体可以包括如下情况:
a、在确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧 时,确定该丢失帧部分为关键语音帧;
b、在确定该前一个相邻的未丢失帧为语音帧,以及该后一个相邻的未丢失帧为静音帧时,确定该丢失帧部分的前一半部分为关键语音帧,以及确定该丢失帧部分的后一半部分为非关键语音帧;
c、在确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为语音帧时,确定该丢失帧部分的前一半部分为非关键语音帧,以及确定丢失帧部分的后一半部分为关键语音帧。
(2)比特流层模型;
比特流层模型的帧内容检测较包层模型更为精细,比如,其中语音帧可以包括关键语音帧和非关键语音帧,等等。
对于比特流层模型来说,其“根据得到的解析结果确定该数据包的帧内容特性”的步骤具体可以如下:
A、根据该丢帧位置在所述数据包中确定当前需要检测的丢失帧部分。
其中,一个丢失帧部分可以包括连续的多个丢失帧。
B、根据计算出的帧能量和平均帧能量确定该丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性。
其中,根据计算出的帧能量和平均帧能量确定未丢失帧的帧内容特性,包括:
若该未丢失帧的帧能量小于等于0,则确定该未丢失帧为静音帧;
若该未丢失帧的帧能量大于0小于平均帧能量,则确定该未丢失帧为非关键语音帧;
若该未丢失帧的帧能量大于平均帧能量,则确定该未丢失帧为关键语音帧。
C、根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性,具体可以如下:
a、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,则确定该丢失帧部分为静音帧;
b、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定该丢失帧部分为关键语音帧;
c、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定该丢失帧部分为非关键语音帧;
d、若确定该前一个相邻的未丢失帧为关键语音帧,以及该后一个相邻的未丢失帧为静音帧,则确定该丢失帧部分的前一半部分为关键语音帧,该丢失帧部分的后一半部分为非关键语音帧;
e、若确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为关键语音帧,则确定该丢失帧部分的前一半部分为非关键语音帧,该丢失帧部分的后一半部分为关键语音帧;
f、若确定该前一个相邻的未丢失帧为关键语音帧,以及该后一个相邻的未丢失帧为非关键语音帧,则确定该丢失帧部分为关键语音帧;
g、若确定该前一个相邻的未丢失帧为非关键语音帧,以及该后一个相邻的未丢失帧为关键语音帧,则确定该丢失帧部分为关键语音帧;
h、若确定该前一个相邻的未丢失帧为非关键语音帧,以及该后一个相邻的未丢失帧为静音帧,则确定该丢失帧部分为非关键语音帧;
i、若确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为非关键语音帧,则确定该丢失帧部分为非关键语音帧。
104、根据确定的帧内容特性对该语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件,其中,语音序列、语句和丢帧事件三者之间的关系具体可参见图1b。
例如,具体可以采用如下的方法进行语句和丢帧事件的划分:
(1)确定静音帧连续出现的帧数超过预置次数时,将该静音帧之前的语音序列划分为语句;
即出现连续Ns帧以上的静音帧时,将该静音帧之前的语音序列划分为语句,其中,该Ns可以根据实际应用的需求进行设置,比如,可以设置Ns为6等。
(2)确定该语句中相邻两次丢帧部分的距离小于等于预置距离时,将该相邻两次丢帧部分确定为一次丢帧事件。
(3)确定该语句中相邻两次丢帧部分的距离大于预置距离时,将该相邻两次丢帧部分确定为两次丢帧事件。
其中,预置次数和预置距离可以根据实际应用的需求进行设置,比如,预 置次数可以设置为6,预置距离可以设置为10,等等。
105、根据该丢帧事件提取非语音参数。
其中,该非语音参数可以包括位置参数和离散分布参数,根据不同的后续语音质量评估方法的不同,此时提取的非语音参数也会有所不同,比如,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度等;或者,该非语音参数也可以非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度等,以下对这些参数进行简略说明:
非关键语音帧和关键语音帧之间的距离Lj:根据人耳的听觉感知特性,丢失的非关键语音帧距离相邻关键语音帧越远,引起的失真越小。
语音帧丢帧次数N1:指的是丢帧事件中丢失语音帧的次数,如图1c所示的丢帧事件,则其丢失语音帧的次数N1=4。
语音帧的一次丢失长度N0k:指的是每次发生丢帧时,连续丢失的语音帧数,如图1c所示的丢帧事件,N01=2,N02=1,N03=2,N04=1。
损伤长度Lk:指的是相邻两次丢帧事件未丢失语音帧的数目,如图1c所示的丢帧事件,L1=2,L2=3,L3=2。
语音帧的平均丢失长度N0
Figure PCTCN2014089401-appb-000002
如图1c所示的丢帧事件,其中,N0=6/4=1.5。
损伤长度L:
Figure PCTCN2014089401-appb-000003
如图1c所示的丢帧事件,其中,L=(L1+L2+L3)/N1=7/3。
106、根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,例如,具体可以如下:
根据该非语音参数按照预置的语音质量评估模型对该丢帧事件进行失真映射,得到丢失的语音帧总数,根据该丢失的语音帧总数计算语句的语音质量。
其中,步骤“根据该非语音参数按照预置的语音质量评估模型对该丢帧事件进行失真映射,得到丢失的语音帧总数”具体可以采用如下任意一种方式来实现:
(1)第一种方式;
非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则此时,步骤“根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数”可以包括:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数,如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=f(Lj);
比如,具体可以为:FLNi,j=exp(-0.033*Lj);
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000004
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧 次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N1,N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000005
其中,N0k为语音帧一次丢失长度,A0k为语音帧丢帧次数和一次丢失长度对单个未丢失语音帧的损伤影响,Lk为丢帧事件第k次出现的损伤长度,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中单帧映射的语音帧数。其中,参数a1,b1,c1,a2,b2,c2,a3,b3和c3可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000006
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
(2)第二种方式;
非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则所述根据该非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数,包括:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与第一种方式中的连续丢帧的情况的处理方式相同,详见前面的描述,在此不再赘述。
B、离散丢帧的情况;
确定离散丢帧时,根据该语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的语音帧数将该丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000007
其中,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中映射的语音帧数,A0k为丢帧事件中一次丢失长度对单个未丢失语音帧的损伤影响,参数a1,b1,c1,a2,b2和c2可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000008
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
(3)第三种方式;
与第一和第二种方式不同,第三种方式不再计算单帧的失真,而是直接计算整个丢帧事件的失真。
其中,非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则所述根据所述非语音参数将该丢帧事件中不同位 置下的丢失帧和不同离散分布下的丢失帧映射成丢失的语音帧总数,具体可以包括:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与第一种方式中的连续丢帧的情况的处理方式相同,详见前面的描述,在此不再赘述。
B、离散丢帧的情况;
确定离散丢帧时,根据该平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数,用公式表示可以为:
Figure PCTCN2014089401-appb-000009
其中,FLNi为将该丢帧事件映射成丢失的语音帧总数,N0为语音帧的平均丢失长度,L为损伤长度;其中,参数a1,b1,c1,a2,b2,a3和b3可以通过训练得到。
在得到丢失的语音帧总数之后,就可以根据该丢失的语音帧总数计算语句的语音质量,如下:
一个语句的丢失语音帧数为:
FLN=f(FLN1,FLN2,...,FLNM);
其中,M为每个语句中丢帧事件个数,FLNi为每个丢帧事件的映射得到的丢失的语音帧总数。
由于不考虑遭受数据包丢失的语句质量MOS0为:
MOS0=f(R);
因此,考虑遭受数据包丢失的语句质量Qn可以为:
Qn=f(MOS0,FLN);
其中,如果通过主观实验建立R与MOS0的匹配数据表格,则质量评估时可以直接查表得到MOS0
107、根据每条语句的语音质量评估该语音序列的语音质量,即将该语音序列中的每条语句的语音质量进行综合,得到该语音序列的语音质量Q,如下:
Q=f(Q1,Q2,...,QN);
其中,Qn为考虑遭受数据包丢失的语句质量,N为语音序列中语句的个数。
由上可知,本实施例采用对获取到的网络语音的数据包进行解析,并根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在根据丢帧事件提取非语音参数(包括位置参数和离散分布参数)后,根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;由于在该方案中,可以通过对语音序列进行语句划分和丢帧事件划分,使得单个丢帧事件中的丢帧模式相对比较简单,从而易于研究每个丢帧事件所带来的失真影响;而且,由于该方案在评估网络语音质量的过程中,将帧内容特性(比如确定静音帧还是语音帧)和丢帧位置也作为考虑的因素,因此,相对于现有技术只根据其平均失真情况来衡量网络语音质量的方案而言,可以有效地提高网络语音质量评估的精度;也就是说,采用该方案,可以大大地提高预测精度,从而提高评估结果的准确性。
根据实施例一所描述的方法,以下将在实施例二、三、四和五中举例作进一步详细说明。
实施例二、
在本实施例中,将以包层模型为例进行说明。
如图2a所示,该网络语音质量评估装置可以包括解析模块、检测模块、划分模块、非语音参数提取模块和语音质量评估模块,其中,各个模块的功能具 体可以如下:
(1)解析模块;
该解析模块,用于获取网络语音的数据包,并对获取到的数据包进行解析,得到解析结果,其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等。
(2)检测模块;
丢帧对语音质量的影响与丢帧的内容密切相关,当丢帧内容为静音时,对语音质量损伤的程度较小;当丢失帧内容为语音时,会导致重要语音信息的丢失,对语音质量的影响较大。因此,在进行语音质量评估时,需要对丢帧内容进行检测。所以,该检测模块,主要用于根据得到的解析结果确定该数据包的帧内容特性,即判断每个数据帧是静音帧还是语音帧。例如,具体可以分析未丢失帧的帧内容特性,然后根据语音信号短时相关性的性质,利用相邻未丢失帧的帧内容特性去判断当前丢失帧的帧内容特性。
此外,丢帧发生在单词/汉字的不同位置,其影响也不相同,如图2b所示,A表示单词的中间区(或叫关键区域),B和C分别表示单词的词首和词尾(统称为非关键区),D表示静音区。根据人耳的听觉感知特性,相邻非关键区的丢帧位置距离A区域越远,引起的失真越小。因此在帧内容检测的基础上,该检测模块还可以对丢失部分的帧进行进一步的判断,以确定当前丢失帧是关键语音帧,还是非关键语音帧。
(3)划分模块;
该划分模块,用于根据确定的帧内容特性对该语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件。
(4)非语音参数提取模块;
该非语音参数提取模块,用于根据该丢帧事件提取非语音参数。其中,该非语音参数可以包括位置参数和离散分布参数。
(5)语音质量评估模块;
该语音质量评估模块,用于根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,然后根据每条语句的语音质量评估该语音序列的语音质量。
一种网络语音质量评估方法,如图2c所示,具体流程可以如下:
201、解析模块获取网络语音的数据包。
其中,该网络语音的数据包可以包括包头和语音载荷,其中,包头可以包括RTP头、UDP头和IP头等,语音载荷可以包括语音序列等。
202、解析模块对数据包的包头进行解析,得到解析结果。其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等。
例如,以第i个数据包为例,则上述参数的获取方法具体可以如下:
第i个数据包包含的语音序列的时长Durationi
Durationi=Timestampi+1-Timestampi
其中,Timestampi为第i个数据包的时间戳,Timestampi+1为第i+1个数据包的时间戳,可以从数据包的RTP头中读取。
第i个数据包包含的语音序列的比特数Bi
Bi=LIPi-HIPi-HUDPi-HRTPi
其中,LIPi为第i个数据包的比特数,可以直接由IP头得到;HIPi为第i个数据包的IP协议头长度,HUDPi为第i个数据包的UDP协议头长度,HRTPi为第i个IP数据包的RTP协议头长度。
记录语音负载和该第i个数据包的语音时长Durationmax,其中,语音负载指的是数据包负载最大时的RTP负载比特数,记为Bmax。一般认为该第i个数据包是非静音,则该第i个数据包的非静音的码率为:
Figure PCTCN2014089401-appb-000010
此外,RTP头中的序列号域标示了数据包的顺序,根据每个数据包的RTP序列号就可以确定丢失帧的位置(就丢帧位置)和数量。
203、检测模块根据丢帧位置在该数据包中确定当前需要检测的丢失帧部分。
其中,一个丢失帧部分可以包括连续的多个丢失帧。
204、检测模块分别根据该语音序列的时长、语音序列的比特数和语音负载确定该丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的 未丢失帧的帧内容特性,以及确定该后一个相邻的未丢失帧的标记(mark)。
例如,若当前丢失帧为第n帧到第n+m-1帧(即该丢失帧部分为第n帧到第n+m-1帧),前一个相邻的未丢失帧为第n-1帧,后一个相邻的未丢失帧为第n+m帧,则此时,检测模块可以分别根据该语音序列的时长、语音序列的比特数和语音负载确定第n-1帧的帧内容特性和第n+m帧的帧内容特性,以及确定第n+m帧的标记。
其中,帧内容特性可以包括静音帧和语音帧,则根据该语音序列的时长、语音序列的比特数和语音负载确定未丢失帧的帧内容特性,具体可以包括:
获取未丢失帧的实际有效载荷长度;
根据该语音负载、语音序列的比特数和语音序列的时长确定码率(即编码速率);
若该码率所对应的标准有效载荷长度与该实际有效载荷长度一致,则确定该未丢失帧为语音帧;
若该码率所对应的标准有效载荷长度与实际有效载荷长度不一致,则确定该未丢失帧为静音帧。
其中,具体可以设置一张对应表,用于记录码率和标准有效载荷长度的对应关系,这样,根据码率查找该对应表便可得到相应的标准有效载荷长度,例如,具体可以参见表一。
205、检测模块根据该前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定该丢失帧部分的帧内容特性。
例如,若当前丢失帧为第n帧到第n+m-1帧(即该丢失帧部分为第n帧到第n+m-1帧),前一个相邻的未丢失帧为第n-1帧,后一个相邻的未丢失帧为第n+m帧,则此时具体可以如下:
若确定所述第n-1帧和第n+m帧均为静音帧,或第n+m帧的标记指示所述第n+m帧为第一个语音帧(比如标记为1),则确定所述丢失帧部分为静音帧,否则,均确定该丢失帧部分为语音帧。
此外,为了进一步提高预测精度,还可以将语音帧划分为关键语音帧和非关键语音帧,若将语音帧划分为关键语音帧和非关键语音帧,则在步骤“确定 该丢失帧部分为语音帧”具体可以包括如下情况:
a、在确定第n-1帧和第n+m帧均为语音帧时,确定该丢失帧部分为关键语音帧;
b、在确定第n-1帧为语音帧,以及第n+m帧为静音帧时,确定该丢失帧部分的前一半部分为关键语音帧,以及确定该丢失帧部分的后一半部分为非关键语音帧;
c、在确定第n-1帧为静音帧,以及第n+m帧为语音帧时,确定该丢失帧部分的前一半部分为非关键语音帧,以及确定丢失帧部分的后一半部分为关键语音帧。
206、划分单元根据确定的帧内容特性对该语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件。
例如,具体可以采用如下的方法进行语句和丢帧事件的划分:
(1)确定静音帧连续出现的帧数超过预置次数时,将该静音帧之前的语音序列划分为语句;
即出现连续Ns帧以上的静音帧时,将该静音帧之前的语音序列划分为语句,其中,该Ns可以根据实际应用的需求进行设置,比如,可以设置Ns为6等。
(2)确定该语句中相邻两次丢帧部分的距离小于等于预置距离时,将该相邻两次丢帧部分确定为一次丢帧事件。
(3)确定该语句中相邻两次丢帧部分的距离大于预置距离时,将该相邻两次丢帧部分确定为两次丢帧事件。
其中,预置次数和预置距离可以根据实际应用的需求进行设置,比如,预置次数可以设置为6,预置距离可以设置为10,等等。
207、非语音参数提取模块根据该丢帧事件提取非语音参数。
其中,该非语音参数可以包括位置参数和离散分布参数,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度等,如下:
非关键语音帧和关键语音帧之间的距离Lj:根据人耳的听觉感知特性,丢失的非关键语音帧距离相邻关键语音帧越远,引起的失真越小。
语音帧丢帧次数N1:指的是丢帧事件中丢失语音帧的次数。
语音帧的一次丢失长度N0k:指的是每次发生丢帧时,连续丢失的语音帧数。
损伤长度Lk:指的是相邻两次丢帧事件未丢失语音帧的数目。
语音帧的平均丢失长度N0
Figure PCTCN2014089401-appb-000011
损伤长度L:
Figure PCTCN2014089401-appb-000012
208、语音质量评估模块根据得到的非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数,例如,具体可以如下:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数,如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=f(Lj);
比如,具体可以为:FLNi,j=exp(-0.033*Lj);
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000013
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N1,N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000014
其中,N0k为语音帧一次丢失长度,A0k为语音帧丢帧次数和一次丢失长度对单个未丢失语音帧的损伤影响,Lk为丢帧事件第k次出现的损伤长度,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中单帧映射的语音帧数。其中,参数a1,b1,c1,a2,b2,c2,a3,b3和c3可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000015
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
209、语音质量评估模块根据该丢失的语音帧总数计算语句的语音质量,如下:
一个语句的丢失语音帧数为:
FLN=f(FLN1,FLN2,...,FLNM);
其中,M为每个语句中丢帧事件个数,FLNi为每个丢帧事件的映射得到的丢失的语音帧总数。比如,该函数具体可以如下:
Figure PCTCN2014089401-appb-000016
由于不考虑遭受数据包丢失的语句质量MOS0(即语句的压缩失真)为:
MOS0=f(R);
因此,考虑遭受数据包丢失的语句质量Qn可以为:
Qn=f(MOS0,FLN);
比如,该函数具体可以如下:
Figure PCTCN2014089401-appb-000017
其中,D为语句失真,MOS0不考虑遭受数据包丢失的语句质量(即语句的压缩失真),Qn为考虑遭受数据包丢失的语句质量,a和b为模型固定参数,a和b可以通过训练得到。
其中,如果通过主观实验建立码率R(即编码速率)与MOS0的匹配数据表格,则质量评估时可以直接查表得到MOS0。例如,具体可参见表二。
表二:
码率(kb/s) MOS0
4.75 3.465
5.15 3.502
5.9 3.563
6.7 3.631
7.4 3.725
7.95 3.836
10.2 3.964
12.2 4.086
…… ……
比如,通过查找表二可知,码率为4.75kb/s所对应的MOS0为3.465,码率为5.15kb/s所对应的MOS0为3.502,以此类推,等等。
210、语音质量评估模块根据每条语句的语音质量评估该语音序列的语音质量,即将该语音序列中的每条语句的语音质量进行综合,得到该语音序列的语音质量Q,如下:
Q=f(Q1,Q2,...,QN);
其中,Qn为考虑遭受数据包丢失的语句质量,N为语音序列中语句的个数。比如,该函数具体可以如下:
Figure PCTCN2014089401-appb-000018
其中,Q为语音序列的语音质量,Qn为考虑遭受数据包丢失的语句质量,N为语音序列中语句的个数。
由上可知,本实施例采用基于包层模型的方式对获取到的网络语音的数据包进行解析,并根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在根据丢帧事件提取非语音参数(包括位置参数和离散分布参数)后,根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;由于在该方案中,可以通过对语音序列进行语句划分和丢帧事件划分,使得单个丢帧事件中的丢帧模式相对比较简单,从而易于研究每个丢帧事件所带来的失真影响;而且,由于该方案在评估网络语音质量的过程中,将帧内容特性(比如确定静音帧还是语音帧)和丢帧位置也作为考虑的因素,因此,相对于现有技术只根据其平均失真情况来衡量网络语音质量的方案而言,可以有效地提高网络语音质量评估的精度;也就是说,采用该方案,可以大大地提高预测精度,从而提高评估结果的准确性。
实施例三、
在本实施例中,将以比特流层模型为例进行说明。
本实施例所采用的网络语音质量评估装置与实施例二相同,详见图2a以及实施例二中的描述。本实施例与实施例二不同的之处,主要在于数据包的解析、以及帧内容特性的检测上,以下将进行详细说明。
一种网络语音质量评估方法,如图3所示,具体流程可以如下:
301、解析模块获取网络语音的数据包。
其中,该网络语音的数据包可以包括包头和语音载荷,其中,包头可以包括RTP头、UDP头和IP头等,语音载荷可以包括语音序列等。
302、解析模块对数据包的包头进行解析,得到解析结果。其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等。
例如,以第i个数据包为例,则上述参数的获取方法具体可以如下:
第i个数据包包含的语音序列的时长Durationi
Durationi=Timestampi+1-Timestampi
其中,Timestampi为第i个数据包的时间戳,Timestampi+1为第i+1个数据包的时间戳,可以从数据包的RTP头中读取。
第i个数据包包含的语音序列的比特数Bi
Bi=LIPi-HIPi-HUDPi-HRTPi
其中,LIPi为第i个数据包的比特数,可以直接由IP头得到;HIPi为第i个数据包的IP协议头长度,HUDPi为第i个数据包的UDP协议头长度,HRTPi为第i个IP数据包的RTP协议头长度。
记录语音负载和该第i个数据包的语音时长Durationmax,其中,语音负载指的是数据包负载最大时的RTP负载比特数,记为Bmax。一般认为该第i个数据包是非静音,则该第i个数据包的非静音的码率为:
Figure PCTCN2014089401-appb-000019
此外,RTP头中的序列号域标示了数据包的顺序,根据每个数据包的RTP序列号就可以确定丢失帧的位置(就丢帧位置)和数量。
303、解析模块根据语音负载进行自适应多速率(AMR,Adaptive Multi-Rate)解码,得到AMR解码后语音信号。
304、解析模块根据语音序列的时长和语音序列的比特数计算该AMR解码后语音信号中每一帧的帧能量和平均帧能量。
其中,每一帧的帧能量可以根据人耳的听觉特性和主观体验进行量化而得,若该帧能量大于0,则为语音帧,并依此计算语音帧的平均能力,得到平均帧能量。
305、检测模块根据该丢帧位置在所述数据包中确定当前需要检测的丢失帧部分。
其中,一个丢失帧部分可以包括连续的多个丢失帧。
306、检测模块根据计算出的帧能量和平均帧能量确定该丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性。
例如,若当前丢失帧为第n帧到第n+m-1帧(即该丢失帧部分为第n帧到第n+m-1帧),前一个相邻的未丢失帧为第n-1帧,后一个相邻的未丢失帧为第n+m帧,则此时,检测模块可以分别根据该语音序列的时长、语音序列的比特数和语音负载确定第n-1帧的帧内容特性和第n+m帧的帧内容特性。
其中,帧内容特性可以包括静音帧和语音帧,则根据计算出的帧能量和平均帧能量确定未丢失帧的帧内容特性,包括:
若该未丢失帧的帧能量小于等于0,则确定该未丢失帧为静音帧;
若该未丢失帧的帧能量大于0小于平均帧能量,则确定该未丢失帧为非关键语音帧;
若该未丢失帧的帧能量大于平均帧能量,则确定该未丢失帧为关键语音帧。
307、检测模块根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性,具体可以如下:
例如,若当前丢失帧为第n帧到第n+m-1帧(即该丢失帧部分为第n帧到第n+m-1帧),前一个相邻的未丢失帧为第n-1帧,后一个相邻的未丢失帧为第n+m帧,则此时具体可以如下:
a、若确定第n-1帧和第n+m帧均为静音帧,则确定该丢失帧部分为静音帧;
b、若确定第n-1帧和第n+m帧均为关键语音帧,则确定该丢失帧部分为关 键语音帧;
c、若确定第n-1帧和第n+m帧均为非关键语音帧,则确定该丢失帧部分为非关键语音帧;
d、若确定第n-1帧为关键语音帧,以及第n+m帧为静音帧,则确定该丢失帧部分的前一半部分为关键语音帧,该丢失帧部分的后一半部分为非关键语音帧;
e、若确定第n-1帧为静音帧,以及第n+m帧为关键语音帧,则确定该丢失帧部分的前一半部分为非关键语音帧,该丢失帧部分的后一半部分为关键语音帧;
f、若确定第n-1帧为关键语音帧,以及第n+m帧为非关键语音帧,则确定该丢失帧部分为关键语音帧;
g、若确定第n-1帧为非关键语音帧,以及第n+m帧为关键语音帧,则确定该丢失帧部分为关键语音帧;
h、若确定第n-1帧为非关键语音帧,以及第n+m帧为静音帧,则确定该丢失帧部分为非关键语音帧;
i、若确定第n-1帧为静音帧,以及第n+m帧为非关键语音帧,则确定该丢失帧部分为非关键语音帧。
308、划分单元根据确定的帧内容特性对该语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件。
例如,具体可以采用如下的方法进行语句和丢帧事件的划分:
(1)确定静音帧连续出现的帧数超过预置次数时,将该静音帧之前的语音序列划分为语句;
即出现连续Ns帧以上的静音帧时,将该静音帧之前的语音序列划分为语句,其中,该Ns可以根据实际应用的需求进行设置,比如,可以设置Ns为6等。
(2)确定该语句中相邻两次丢帧部分的距离小于等于预置距离时,将该相邻两次丢帧部分确定为一次丢帧事件。
(3)确定该语句中相邻两次丢帧部分的距离大于预置距离时,将该相邻两次丢帧部分确定为两次丢帧事件。
其中,预置次数和预置距离可以根据实际应用的需求进行设置,比如,预 置次数可以设置为6,预置距离可以设置为10,等等。
309、非语音参数提取模块根据该丢帧事件提取非语音参数。
其中,该非语音参数可以包括位置参数和离散分布参数,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度等,如下:
非关键语音帧和关键语音帧之间的距离Lj:根据人耳的听觉感知特性,丢失的非关键语音帧距离相邻关键语音帧越远,引起的失真越小。
语音帧丢帧次数N1:指的是丢帧事件中丢失语音帧的次数。
语音帧的一次丢失长度N0k:指的是每次发生丢帧时,连续丢失的语音帧数。
损伤长度Lk:指的是相邻两次丢帧事件未丢失语音帧的数目。
语音帧的平均丢失长度N0
Figure PCTCN2014089401-appb-000020
损伤长度L:
Figure PCTCN2014089401-appb-000021
310、语音质量评估模块根据得到的非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数,例如,具体可以如下:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数,如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=f(Lj);
比如,具体可以为:FLNi,j=exp(-0.033*Lj)。
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000022
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N1,N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000023
其中,N0k为语音帧一次丢失长度,A0k为语音帧丢帧次数和一次丢失长度对单个未丢失语音帧的损伤影响,Lk为丢帧事件第k次出现的损伤长度,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中单帧映射的语音帧数。其中,参数a1,b1,c1,a2,b2,c2,a3,b3和c3可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000024
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目), ni表示实际丢失的关键语音帧数。
311、语音质量评估模块根据该丢失的语音帧总数计算语句的语音质量,如下:
一个语句的丢失语音帧数为:
FLN=f(FLN1,FLN2,...,FLNM);
其中,M为每个语句中丢帧事件个数,FLNi为每个丢帧事件的映射得到的丢失的语音帧总数。比如,该函数具体可以如下:
Figure PCTCN2014089401-appb-000025
由于不考虑遭受数据包丢失的语句质量MOS0(即语句的压缩失真)为:
MOS0=f(R);
因此,考虑遭受数据包丢失的语句质量Qn可以为:
Qn=f(MOS0,FLN);
比如,该函数具体可以如下:
Figure PCTCN2014089401-appb-000026
其中,D为语句失真,MOS0不考虑遭受数据包丢失的语句质量(即语句的压缩失真),Qn为考虑遭受数据包丢失的语句质量,a和b为模型固定参数,a和b可以通过训练得到。
其中,如果通过主观实验建立R与MOS0的匹配数据表格,则质量评估时可以直接查表得到MOS0,具体可参见表二,在此不再赘述。
312、语音质量评估模块根据每条语句的语音质量评估该语音序列的语音质量,即将该语音序列中的每条语句的语音质量进行综合,得到该语音序列的语音质量Q,如下:
Q=f(Q1,Q2,...,QN);
其中,Qn为考虑遭受数据包丢失的语句质量,N为语音序列中语句的个数。比如,该函数具体可以如下:
Figure PCTCN2014089401-appb-000027
其中,Q为语音序列的语音质量,Qn为考虑遭受数据包丢失的语句质量,N为语音序列中语句的个数。
由上可知,本实施例采用基于比特流层模型的方式对获取到的网络语音的数据包进行解析,并根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在根据丢帧事件提取非语音参数(包括位置参数和离散分布参数)后,根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;由于在该方案中,可以通过对语音序列进行语句划分和丢帧事件划分,使得单个丢帧事件中的丢帧模式相对比较简单,从而易于研究每个丢帧事件所带来的失真影响;而且,由于该方案在评估网络语音质量的过程中,将帧内容特性(比如确定静音帧还是语音帧)和丢帧位置也作为考虑的因素,因此,相对于现有技术只根据其平均失真情况来衡量网络语音质量的方案而言,可以有效地提高网络语音质量评估的精度;也就是说,采用该方案,可以大大地提高预测精度,从而提高评估结果的准确性。
实施例四、
除了实施例二和三所提供的对丢帧事件的失真映射方案方法之外,还可以采用其他的方式对丢帧事件进行映射。即步骤“语音质量评估模块根据得到的非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数”具体也可以如下:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢 失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数,如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=f(Lj);
比如,具体可以为:
FLNi,j=exp(-0.033*Lj);
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000028
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据该语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的语音帧数将该丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000029
其中,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中映射的语音帧数,A0k为丢帧事件中一次丢失长度对单个未丢失语音帧的损伤影响,参数a1,b1,c1,a2,b2和c2可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000030
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
可见,对于连续丢帧的情况,本实施例和实施例二和三关于失真映射的处理方式是一致的,但是对于离散丢帧的情况,本实施例所采用的方案只需考虑非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度等因素,而无需考虑语音帧丢帧次数,而实施例二和三所采用的方案只除了需要考虑非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度等因素之外,还需考虑考虑语音帧丢帧次数,这两种方案各有优点,在实际应用时,可以根据需求自行进行选择。
需说明的是,在本实施例中,除了上述失真映射方法与实施例二和三略有不同之外,其他步骤的实施与实施例二和三相同,故在此不再赘述,具体可参见实施例二和三。
本实施例同样可以实现实施例二和三一样的有益效果,详见前面实施例,在此不再赘述。
实施例五、
在实施例二、三和四中,所提取的非语音参数主要包括关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度等,与实施例二、三和四不同的是,本实施例所提取的非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度等,如下:
非关键语音帧和关键语音帧之间的距离Lj:根据人耳的听觉感知特性,丢失的非关键语音帧距离相邻关键语音帧越远,引起的失真越小。
语音帧丢帧次数N1:指的是丢帧事件中丢失语音帧的次数。
损伤长度Lk:指的是相邻两次丢帧事件未丢失语音帧的数目。
语音帧的平均丢失长度N0
Figure PCTCN2014089401-appb-000031
损伤长度L:
Figure PCTCN2014089401-appb-000032
由于所提取的非语音参数与实施例二、三和四不同,因此,后续对失真事件的失真映射也不同,在实施例二、三和四中,均需要计算单帧的失真,而在本实施例中,可以直接计算整个丢帧事件的失真,如下:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与实施例二、三和四中关于连续丢帧的情况的处理方式相同,即具体可以如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=f(Lj);
比如,具体可以为:
FLNi,j=exp(-0.033*Lj);
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000033
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据该平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数,用公式表示可以为:
Figure PCTCN2014089401-appb-000034
其中,FLNi为将该丢帧事件映射成丢失的语音帧总数,N0为语音帧的平均丢失长度,L为损伤长度;其中,参数a1,b1,c1,a2,b2,a3和b3可以通过训练得到。
可见,对于连续丢帧的情况,本实施例与实施例二、三和四关于失真映射的处理方式是一致的,但是对于离散丢帧的情况,本实施例所采用的方案与实施例二、三和四并不同,在实施例二、三和四中,需要计算单帧的失真,然后再综合单帧的失真来得到整个丢帧事件的失真,而在本实施例的方案中,可以直接根据语音帧的平均丢失长度和损伤长度等计算整个丢帧事件的失真。
需说明的是,本实施例所提供的方案和实施例二、三和四所提供的方案各有优点,在实际应用时,可以根据需求自行进行选择。
此外,还需说明的是,在本实施例中,除了上述提取非语音参数和失真映射方法与实施例二、三和四略有不同之外,其他步骤的实施与实施例二、三和四相同,故在此不再赘述,具体可参见实施例二、三和四。
本实施例同样可以实现实施例二、三和四一样的有益效果,详见前面实施例,在此不再赘述。
实施例六、
相应的,本发明实施例还提供一种网络语音质量评估装置,如图4所示,该网络语音质量评估装置包括获取单元401、解析单元402、确定单元403、划分单元404、提取单元405和评估单元406。
获取单元401,用于获取网络语音的数据包,其中,该网络语音的数据包包括语音序列;
解析单元402,用于对获取单元获取到的数据包进行解析,得到解析结果。
确定单元403,用于根据解析单元得到的解析结果确定该数据包的帧内容 特性,其中,帧内容特性可以包括静音帧和语音帧。
划分单元404,用于根据确定单元确定的帧内容特性对语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件。
提取单元405,用于根据划分单元划分的丢帧事件提取非语音参数,其中,非语音参数包括位置参数和离散分布参数。
评估单元406,用于根据提取单元提取的非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,根据每条语句的语音质量评估所述语音序列的语音质量。
其中,根据不同的网络语音评估模型,其对数据包的解析方法也有所不同,比如,以包层模型和比特流层模型为例,具体可以如下:
(1)包层模型;
解析单元402,具体可以用于对所述数据包的包头进行解析,得到解析结果,其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等。
例如,以第i个数据包为例,则上述参数的获取方法具体可以如下:
第i个数据包包含的语音序列的时长Durationi
Durationi=Timestampi+1-Timestampi
其中,Timestampi为第i个数据包的时间戳,Timestampi+1为第i+1个数据包的时间戳,可以从数据包的RTP头中读取。
第i个数据包包含的语音序列的比特数Bi
Bi=LIPi-HIPi-HUDPi-HRTPi
其中,LIPi为第i个数据包的比特数,可以直接由IP头得到;HIPi为第i个数据包的IP协议头长度,HUDPi为第i个数据包的UDP,HRTPi为第i个IP数据包的RTP协议头长度。
记录语音负载和该第i个数据包的语音时长Durationmax,其中,语音负载指的是数据包负载最大时的RTP负载比特数,记为Bmax。一般认为该第i个数据包是非静音,则该第i个数据包的非静音的码率为:
Figure PCTCN2014089401-appb-000035
此外,RTP头中的序列号域标示了数据包的顺序,可以根据每个数据包的RTP序列号来确定丢失帧的位置(就丢帧位置)和数量。
则此时,确定单元403,具体可以用于根据该丢帧位置在该数据包中确定当前需要检测的丢失帧部分,分别根据该语音序列的时长、语音序列的比特数和语音负载确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,以及确定后一个相邻的未丢失帧的标记,根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性。
其中,根据所述语音序列的时长、语音序列的比特数和语音负载确定未丢失帧的帧内容特性,具体可以包括:获取未丢失帧的实际有效载荷长度;根据该语音负载、语音序列的比特数和语音序列的时长确定码率(即编码速率);若该码率所对应的标准有效载荷长度与该实际有效载荷长度一致,则确定该未丢失帧为语音帧;若该码率所对应的标准有效载荷长度与实际有效载荷长度不一致,则确定该未丢失帧为静音帧,即:
确定单元403,具体可以用于获取未丢失帧的实际有效载荷长度;根据所述语音负载、语音序列的比特数和语音序列的时长确定码率;若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度一致,则确定所述未丢失帧为语音帧;若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度不一致,则确定所述未丢失帧为静音帧。
其中,“根据该前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定该丢失帧部分的帧内容特性”具体可以如下:若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧(比如标记为1),则确定所述丢失帧部分为静音帧,否则,均确定该丢失帧部分为语音帧,即:
确定单元403,具体可以用于若确定前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示该后一个相邻的 未丢失帧为第一个语音帧,则确定所述丢失帧部分为静音帧,否则,确定该丢失帧部分为语音帧。
此外,为了进一步提高预测精度,还可以将语音帧划分为关键语音帧和非关键语音帧,以便后续可以针对这些关键语音帧和非关键语音帧作出不同的处理。其中,关键语音帧指的是对语音质量的影响较大的帧,而关键语音帧指的是对语音质量的影响较小的帧。
若将语音帧划分为关键语音帧和非关键语音帧,则“确定该丢失帧部分为语音帧”具体可以包括如下情况:
a、在确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧时,确定该丢失帧部分为关键语音帧;
b、在确定该前一个相邻的未丢失帧为语音帧,以及该后一个相邻的未丢失帧为静音帧时,确定该丢失帧部分的前一半部分为关键语音帧,以及确定该丢失帧部分的后一半部分为非关键语音帧;
c、在确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为语音帧时,确定该丢失帧部分的前一半部分为非关键语音帧,以及确定丢失帧部分的后一半部分为关键语音帧。
也就是说,确定单元403,具体可以用于执行上述a~c的操作。
(2)比特流层模型;
与包层模型不同的是,比特流程模型除了需要对数据包的包头进行解析之外,还需要对语音负载部分也进行解析,如下:
解析单元402,具体可以用于对所述数据包的包头进行解析,得到解析结果,其中,该解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等;根据所述语音负载进行AMR解码,得到AMR解码后语音信号;根据该语音序列的时长和语音序列的比特数计算所述AMR解码后语音信号中每一帧的帧能量和平均帧能量;
其中,解析结果中所包含的各种信息的具体获取与包层模型相同,在此不再赘述。
则此时,确定单元403,具体可以用于根据该丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,根据计算出的帧能量和平均帧能量确定该丢失 帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据该前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性。
其中,根据计算出的帧能量和平均帧能量确定未丢失帧的帧内容特性,包括:若该未丢失帧的帧能量小于等于0,则确定该未丢失帧为静音帧;若该未丢失帧的帧能量大于0小于平均帧能量,则确定该未丢失帧为非关键语音帧;若该未丢失帧的帧能量大于平均帧能量,则确定该未丢失帧为关键语音帧。即:
确定单元403,具体可以用于若所述未丢失帧的帧能量小于等于0,则确定所述未丢失帧为静音帧;若所述未丢失帧的帧能量大于0小于平均帧能量,则确定所述未丢失帧为非关键语音帧;若所述未丢失帧的帧能量大于平均帧能量,则确定所述未丢失帧为关键语音帧。
其中,“根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性”具体可以如下:
a、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,则确定该丢失帧部分为静音帧;
b、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定该丢失帧部分为关键语音帧;
c、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定该丢失帧部分为非关键语音帧;
d、若确定该前一个相邻的未丢失帧为关键语音帧,以及该后一个相邻的未丢失帧为静音帧,则确定该丢失帧部分的前一半部分为关键语音帧,该丢失帧部分的后一半部分为非关键语音帧;
e、若确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为关键语音帧,则确定该丢失帧部分的前一半部分为非关键语音帧,该丢失帧部分的后一半部分为关键语音帧;
f、若确定该前一个相邻的未丢失帧为关键语音帧,以及该后一个相邻的未丢失帧为非关键语音帧,则确定该丢失帧部分为关键语音帧;
g、若确定该前一个相邻的未丢失帧为非关键语音帧,以及该后一个相邻的未丢失帧为关键语音帧,则确定该丢失帧部分为关键语音帧;
h、若确定该前一个相邻的未丢失帧为非关键语音帧,以及该后一个相邻的未丢失帧为静音帧,则确定该丢失帧部分为非关键语音帧;
i、若确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为非关键语音帧,则确定该丢失帧部分为非关键语音帧。
也就是说,确定单元403,具体可以用于执行上述a~i的操作。
其中,划分单元404,具体可以用于在确定静音帧连续出现的帧数超过预置次数时,将所述静音帧之前的语音序列划分为语句;以及,在确定语句中相邻两次丢帧部分的距离小于等于预置距离时,将所述相邻两次丢帧部分确定为一次丢帧事件;以及,在确定语句中相邻两次丢帧部分的距离大于预置距离时,将所述相邻两次丢帧部分确定为两次丢帧事件。
其中,预置次数和预置距离可以根据实际应用的需求进行设置,比如,预置次数可以设置为6,预置距离可以设置为10,等等。
其中,评估单元406,具体可以用于根据提取单元405提取到的非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数;根据该丢失的语音帧总数计算语句的语音质量。
其中,步骤“根据该非语音参数按照预置的语音质量评估模型对该丢帧事件进行失真映射,得到丢失的语音帧总数”具体可以采用如下任意一种方式来实现:
(1)第一种方式;
非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则此时,评估单元406具体可以用于执行如下操作:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数,如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=f(Lj);
比如,具体可以为:FLNi,j=exp(-0.033*Lj);
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000036
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N1,N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000037
其中,N0k为语音帧一次丢失长度,A0k为语音帧丢帧次数和一次丢失长度对单个未丢失语音帧的损伤影响,Lk为丢帧事件第k次出现的损伤长度,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中单帧映射的语音帧数。其中,参数a1,b1,c1,a2,b2,c2,a3,b3和c3可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000038
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
(2)第二种方式;
非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则此时,评估单元406具体可以用于执行如下操作:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与第一种方式中的连续丢帧的情况的处理方式相同,详见前面的描述,在此不再赘述。
B、离散丢帧的情况;
确定离散丢帧时,根据该语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的语音帧数将该丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
FLNi,k=f(N0k,Lk);
比如,具体可以为:
Figure PCTCN2014089401-appb-000039
其中,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中映射的语音帧数,A0k为丢帧事件中一次丢失长度对单个未丢失语音帧的损伤影响,参数a1,b1,c1, a2,b2和c2可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000040
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
(3)第三种方式;
与第一和第二种方式不同,第三种方式不再计算单帧的失真,而是直接计算整个丢帧事件的失真。
其中,非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则此时,评估单元406具体可以用于执行如下操作:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与第一种方式中的连续丢帧的情况的处理方式相同,详见前面的描述,在此不再赘述。
B、离散丢帧的情况;
确定离散丢帧时,根据该平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数,用公式表示可以为:
Figure PCTCN2014089401-appb-000041
其中,FLNi为将该丢帧事件映射成丢失的语音帧总数,N0为语音帧的平均丢失长度,L为损伤长度;其中,参数a1,b1,c1,a2,b2,a3和b3可以通过训 练得到。
在得到丢失的语音帧总数之后,就可以根据该丢失的语音帧总数计算语句的语音质量,如下:
一个语句的丢失语音帧数为:
FLN=f(FLN1,FLN2,...,FLNM);
其中,M为每个语句中丢帧事件个数,FLNi为每个丢帧事件的映射得到的丢失的语音帧总数。
由于不考虑遭受数据包丢失的语句质量MOS0为:
MOS0=f(R);
因此,考虑遭受数据包丢失的语句质量Qn可以为:
Qn=f(MOS0,FLN);
其中,如果通过主观实验建立R与MOS0的匹配数据表格,则质量评估时可以直接查表得到MOS0
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的实施具体可参见前面的实施例,在此不再赘述。
该网络语音质量评估装置具体可以集成在服务器等网络侧设备中。
由上可知,本实施例的网络语音质量评估装置的解析单元402可以将获取单元401获取到的网络语音的数据包进行解析,并由确定单元403根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后由划分单元404根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在提取单元405根据丢帧事件提取非语音参数(包括位置参数和离散分布参数)后,由评估单元406根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;由于在该方案中,可以通过对语音序列进行语句划分和丢帧事件划分,使得单个丢帧事件中的丢帧模式相对比较简单,从而易于研究每个丢帧事件所带来的失真影响;而且,由于该方案在评估网络语音质量的过程中,将帧内容特性(比如确定静音帧还是语音帧)和丢帧位置也作为考虑的因素, 因此,相对于现有技术只根据其平均失真情况来衡量网络语音质量的方案而言,可以有效地提高网络语音质量评估的精度;也就是说,采用该方案,可以大大地提高预测精度,从而提高评估结果的准确性。
实施例七、
相应的,本发明实施例还提供一种通信系统,包括本发明实施例提供的任一种网络语音质量评估装置,该网络语音质量评估装置具体可参见实施例六,在此不再赘述。
由于该通信系统可以包括本发明实施例提供的任一种网络语音质量评估装置,因此,可以实现本发明实施例提供的任一种网络语音质量评估装置所能实现的有益效果,详见前面的实施例,在此不再赘述。
实施例八、
此外,本发明实施例还提供一种网络侧设备,包括用于存储数据的存储器501、用于收发数据的收发接口502、以及处理器503;其中:
处理器503,可以用于通过收发接口502获取网络语音的数据包,其中,该网络语音的数据包包括语音序列;对该数据包进行解析,得到解析结果;根据该解析结果确定该数据包的帧内容特性,根据确定的帧内容特性对该语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件;根据丢帧事件提取非语音参数;根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量;根据该每条语句的语音质量评估语音序列的语音质量。
其中,根据不同的网络语音评估模型,处理器503对数据包的解析方式也有所不同,比如,以包层模型和比特流层模型为例,处理器503对数据包的解析具体可以如下:
(1)包层模型;
具体可以对数据包的包头进行解析,得到解析结果。其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等。例如,以第i个数据包为例,则上述参数的获取方法具体可以如下:
第i个数据包包含的语音序列的时长Durationi
Durationi=Timestampi+1-Timestampi
其中,Timestampi为第i个数据包的时间戳,Timestampi+1为第i+1个数据包的时间戳,可以从数据包的实时传输协议(RTP,Real-time Transport Protocol)头中读取。
第i个数据包包含的语音序列的比特数Bi
Bi=LIPi-HIPi-HUDPi-HRTPi
其中,LIPi为第i个数据包的比特数,可以直接由IP头得到;HIPi为第i个数据包的IP协议头长度,HUDPi为第i个数据包的用户数据包协议(UDP,User Datagram Protocol)协议头长度,HRTPi为第i个IP数据包的RTP协议头长度。
记录语音负载和该第i个数据包的语音时长Durationmax,其中,语音负载指的是数据包负载最大时的RTP负载比特数,记为Bmax。一般认为该第i个数据包是非静音,则该第i个数据包的非静音的码率为:
Figure PCTCN2014089401-appb-000042
此外,RTP头中的序列号域标示了数据包的顺序,根据每个数据包的RTP序列号就可以确定丢失帧的位置(就丢帧位置)和数量。
(2)比特流层模型;
与包层模型不同的是,比特流程模型除了需要对数据包的包头进行解析之外,还需要对语音负载部分也进行解析,如下:
A、对数据包的包头进行解析,得到解析结果,其中,该解析结果可以包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载等信息,这些信息的具体获取与包层模型相同,在此不再赘述。
B、根据语音负载进行自适应多速率(AMR,Adaptive Multi-Rate)解码,得到AMR解码后语音信号。
C、根据语音序列的时长和语音序列的比特数计算该AMR解码后语音信号中每一帧的帧能量和平均帧能量。
其中,每一帧的帧能量可以根据人耳的听觉特性和主观体验进行量化而得,若该帧能量大于0,则为语音帧,并依此计算语音帧的平均能力,得到平 均帧能量。
对于不同的网络语音评估模型,由于其获取到的解析结果不同,因此,其确定数据包的帧内容特性的方式也有所不同,比如,还是以包层模型和比特流层模型为例,处理器503确定数据包的帧内容特性的方式具体可以如下:
(1)包层模型;
A、根据该丢帧位置在该数据包中确定当前需要检测的丢失帧部分。
B、分别根据该语音序列的时长、语音序列的比特数和语音负载确定该丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,以及确定该后一个相邻的未丢失帧的标记(mark),该标记为帧的序号。
其中,根据所述语音序列的时长、语音序列的比特数和语音负载确定未丢失帧的帧内容特性,具体可以包括:
获取未丢失帧的实际有效载荷长度;
根据该语音负载、语音序列的比特数和语音序列的时长确定码率(即编码速率);
若该码率所对应的标准有效载荷长度与该实际有效载荷长度一致,则确定该未丢失帧为语音帧;
若该码率所对应的标准有效载荷长度与实际有效载荷长度不一致,则确定该未丢失帧为静音帧。
C、根据该前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定该丢失帧部分的帧内容特性,例如,具体可以如下:
若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧(比如标记为1),则确定所述丢失帧部分为静音帧,否则,均确定该丢失帧部分为语音帧。
此外,为了进一步提高预测精度,还可以将语音帧划分为关键语音帧和非关键语音帧,以便后续可以针对这些关键语音帧和非关键语音帧作出不同的处理。其中,关键语音帧指的是对语音质量的影响较大的帧,而关键语音帧指的 是对语音质量的影响较小的帧。
若将语音帧划分为关键语音帧和非关键语音帧,则在步骤“确定该丢失帧部分为语音帧”具体可以包括如下情况:
a、在确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧时,确定该丢失帧部分为关键语音帧;
b、在确定该前一个相邻的未丢失帧为语音帧,以及该后一个相邻的未丢失帧为静音帧时,确定该丢失帧部分的前一半部分为关键语音帧,以及确定该丢失帧部分的后一半部分为非关键语音帧;
c、在确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为语音帧时,确定该丢失帧部分的前一半部分为非关键语音帧,以及确定丢失帧部分的后一半部分为关键语音帧。
(2)比特流层模型;
比特流层模型的帧内容检测较包层模型更为精细,比如,其中语音帧可以包括关键语音帧和非关键语音帧,等等。
对于比特流层模型来说,“根据得到的解析结果确定该数据包的帧内容特性”的操作具体可以如下:
A、根据该丢帧位置在所述数据包中确定当前需要检测的丢失帧部分。
B、根据计算出的帧能量和平均帧能量确定该丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性。
其中,根据计算出的帧能量和平均帧能量确定未丢失帧的帧内容特性,包括:
若该未丢失帧的帧能量小于等于0,则确定该未丢失帧为静音帧;
若该未丢失帧的帧能量大于0小于平均帧能量,则确定该未丢失帧为非关键语音帧;
若该未丢失帧的帧能量大于平均帧能量,则确定该未丢失帧为关键语音帧。
C、根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性,具体可以如下:
a、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧, 则确定该丢失帧部分为静音帧;
b、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定该丢失帧部分为关键语音帧;
c、若确定该前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定该丢失帧部分为非关键语音帧;
d、若确定该前一个相邻的未丢失帧为关键语音帧,以及该后一个相邻的未丢失帧为静音帧,则确定该丢失帧部分的前一半部分为关键语音帧,该丢失帧部分的后一半部分为非关键语音帧;
e、若确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为关键语音帧,则确定该丢失帧部分的前一半部分为非关键语音帧,该丢失帧部分的后一半部分为关键语音帧;
f、若确定该前一个相邻的未丢失帧为关键语音帧,以及该后一个相邻的未丢失帧为非关键语音帧,则确定该丢失帧部分为关键语音帧;
g、若确定该前一个相邻的未丢失帧为非关键语音帧,以及该后一个相邻的未丢失帧为关键语音帧,则确定该丢失帧部分为关键语音帧;
h、若确定该前一个相邻的未丢失帧为非关键语音帧,以及该后一个相邻的未丢失帧为静音帧,则确定该丢失帧部分为非关键语音帧;
i、若确定该前一个相邻的未丢失帧为静音帧,以及该后一个相邻的未丢失帧为非关键语音帧,则确定该丢失帧部分为非关键语音帧。
其中,处理器503在划分语句和丢帧事件时,具体可以执行如下操作:
确定静音帧连续出现的帧数超过预置次数时,将该静音帧之前的语音序列划分为语句;
确定该语句中相邻两次丢帧部分的距离小于等于预置距离时,将该相邻两次丢帧部分确定为一次丢帧事件。
确定该语句中相邻两次丢帧部分的距离大于预置距离时,将该相邻两次丢帧部分确定为两次丢帧事件。
其中,预置次数和预置距离可以根据实际应用的需求进行设置,比如,预置次数可以设置为6,预置距离可以设置为10,等等。
此外,处理器503在根据该非语音参数按照预置的语音质量评估模型对每 条语句的语音质量进行评估时,具体可以根据该非语音参数按照预置的语音质量评估模型对该丢帧事件进行失真映射,得到丢失的语音帧总数,然后再根据该丢失的语音帧总数计算语句的语音质量。
其中,“根据该非语音参数按照预置的语音质量评估模型对该丢帧事件进行失真映射,得到丢失的语音帧总数”具体可以采用如下任意一种方式来实现:
(1)第一种方式;
非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则此时,步骤“根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数”可以包括:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数,如下:
将非关键语音帧映射成丢失的关键语音帧数,用公式表示可以为:
FLNi,j=exp(-0.033*Lj);
其中,FLNi,j为第i个丢帧事件中第j个非关键语音帧映射的关键语音帧数,Lj为第j个非关键语音帧与关键语音帧之间的距离。
丢失的语音帧总数可以为:
Figure PCTCN2014089401-appb-000043
其中,FLNi为第i个丢帧事件映射得到的丢失的语音帧总数(即关键语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
B、离散丢帧的情况;
确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及 损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
Figure PCTCN2014089401-appb-000044
其中,N0k为语音帧一次丢失长度,A0k为语音帧丢帧次数和一次丢失长度对单个未丢失语音帧的损伤影响,Lk为丢帧事件第k次出现的损伤长度,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中单帧映射的语音帧数。其中,参数a1,b1,c1,a2,b2,c2,a3,b3和c3可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000045
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
(2)第二种方式;
非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则所述根据该非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数,包括:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与第一种方式中的连续丢帧的情况的处理方式相同,详见前面的描述,在 此不再赘述。
B、离散丢帧的情况;
确定离散丢帧时,根据该语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的语音帧数将该丢帧事件映射成丢失的语音帧总数;例如,具体可以如下:
将损伤帧映射成丢失的语音帧数,用公式表示可以为:
Figure PCTCN2014089401-appb-000046
其中,FLNi,k为第i个丢帧事件中Lk个语音损伤帧中映射的语音帧数,A0k为丢帧事件中一次丢失长度对单个未丢失语音帧的损伤影响,参数a1,b1,c1,a2,b2和c2可以通过训练得到。
将该丢帧事件映射成丢失的语音帧总数,可以为:
Figure PCTCN2014089401-appb-000047
FLNi为丢帧事件映射得到的丢失的语音帧总数(即语音帧丢失总数目),ni表示实际丢失的关键语音帧数。
(3)第三种方式;
与第一和第二种方式不同,第三种方式不再计算单帧的失真,而是直接计算整个丢帧事件的失真。
其中,非语音参数可以包括位置参数和离散分布参数等,其中,该非语音参数可以包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则所述根据所述非语音参数将该丢帧事件中不同位置下的丢失帧和不同离散分布下的丢失帧映射成丢失的语音帧总数,具体可以包括:
A、连续丢帧的情况;
确定连续丢帧时,根据该非关键语音帧和关键语音帧之间的距离将丢帧事 件中的非关键语音帧映射成丢失的关键语音帧数,根据该语音帧丢帧次数确定实际丢失的关键语音帧数,根据该实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将该丢帧事件映射成丢失的语音帧总数。
与第一种方式中的连续丢帧的情况的处理方式相同,详见前面的描述,在此不再赘述。
B、离散丢帧的情况;
确定离散丢帧时,根据该平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数,用公式表示可以为:
Figure PCTCN2014089401-appb-000048
其中,FLNi为将该丢帧事件映射成丢失的语音帧总数,N0为语音帧的平均丢失长度,L为损伤长度;其中,参数a1,b1,c1,a2,b2,a3和b3可以通过训练得到。
在得到丢失的语音帧总数之后,就可以根据该丢失的语音帧总数计算语句的语音质量,如下:
一个语句的丢失语音帧数为:
FLN=f(FLN1,FLN2,...,FLNM);
其中,M为每个语句中丢帧事件个数,FLNi为每个丢帧事件的映射得到的丢失的语音帧总数。
由于不考虑遭受数据包丢失的语句质量MOS0为:
MOS0=f(R);
因此,考虑遭受数据包丢失的语句质量Qn可以为:
Qn=f(MOS0,FLN);
其中,如果通过主观实验建立R与MOS0的匹配数据表格,则质量评估时可以直接查表得到MOS0
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
由上可知,本实施例的网络侧设备采用对获取到的网络语音的数据包进行解析,并根据解析结果确定该数据包的帧内容特性,比如确定是静音帧和语音帧,然后根据确定的帧内容特性对语音序列进行语句划分,并将语句划分为多个丢帧事件,在根据丢帧事件提取非语音参数(包括位置参数和离散分布参数)后,根据该非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,最后,根据每条语句的语音质量评估整个语音序列的语音质量;由于在该方案中,可以通过对语音序列进行语句划分和丢帧事件划分,使得单个丢帧事件中的丢帧模式相对比较简单,从而易于研究每个丢帧事件所带来的失真影响;而且,由于该方案在评估网络语音质量的过程中,将帧内容特性(比如确定静音帧还是语音帧)和丢帧位置也作为考虑的因素,因此,相对于现有技术只根据其平均失真情况来衡量网络语音质量的方案而言,可以有效地提高网络语音质量评估的精度;也就是说,采用该方案,可以大大地提高预测精度,从而提高评估结果的准确性。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
以上对本发明实施例所提供的一种网络语音质量评估方法、装置和系统进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (25)

  1. 一种网络语音质量评估方法,其特征在于,包括:
    获取网络语音的数据包,所述网络语音的数据包包括语音序列;
    对所述数据包进行解析,得到解析结果;
    根据所述解析结果确定所述数据包的帧内容特性,所述帧内容特性包括静音帧和语音帧;
    根据确定的帧内容特性对所述语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件;
    根据所述丢帧事件提取非语音参数,所述非语音参数包括位置参数和离散分布参数;
    根据所述非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量;
    根据所述每条语句的语音质量评估所述语音序列的语音质量。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述数据包进行解析,得到解析结果,包括:
    对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;
    所述根据所述解析结果确定所述数据包的帧内容特性,包括:根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,分别根据所述语音序列的时长、语音序列的比特数和语音负载确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性。
  3. 根据权利要求2所述的方法,其特征在于,根据所述语音序列的时长、语音序列的比特数和语音负载确定未丢失帧的帧内容特性,包括:
    获取未丢失帧的实际有效载荷长度;
    根据所述语音负载、语音序列的比特数和语音序列的时长确定码率;
    若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度一致,则 确定所述未丢失帧为语音帧;
    若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度不一致,则确定所述未丢失帧为静音帧。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性,包括:
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧,则确定所述丢失帧部分为静音帧,否则,确定所述丢失帧部分为语音帧。
  5. 根据权利要求4所述的方法,其特征在于,所述语音帧包括关键语音帧和非关键语音帧,则在所述确定所述丢失帧部分为语音帧包括:
    在确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧时,确定所述丢失帧部分为关键语音帧;
    在确定所述前一个相邻的未丢失帧为语音帧,以及所述后一个相邻的未丢失帧为静音帧时,确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;
    在确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为语音帧时,确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧。
  6. 根据权利要求1所述的方法,其特征在于,所述对所述数据包进行解析,得到解析结果,包括:
    对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;
    根据所述语音负载进行自适应多速率AMR解码,得到AMR解码后语音信号;
    根据所述语音序列的时长和语音序列的比特数计算所述AMR解码后语音信号中每一帧的帧能量和平均帧能量;
    所述根据所述解析结果确定所述数据包的帧内容特性,包括:根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,根据计算出的帧能量 和平均帧能量确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性。
  7. 根据权利要求6所述的方法,其特征在于,根据计算出的帧能量和平均帧能量确定未丢失帧的帧内容特性,包括:
    若所述未丢失帧的帧能量小于等于0,则确定所述未丢失帧为静音帧;
    若所述未丢失帧的帧能量大于0小于平均帧能量,则确定所述未丢失帧为非关键语音帧;
    若所述未丢失帧的帧能量大于平均帧能量,则确定所述未丢失帧为关键语音帧。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性,包括:
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,则确定所述丢失帧部分为静音帧;
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定所述丢失帧部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定所述丢失帧部分为非关键语音帧;
    若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;
    若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为关键语音帧,则确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻 的未丢失帧为关键语音帧,则确定所述丢失帧部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分为非关键语音帧;
    若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为非关键语音帧。
  9. 根据权利要求4、5或8所述的方法,其特征在于,所述根据确定的帧内容特性对所述语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件,包括:
    确定静音帧连续出现的帧数超过预置次数时,将所述静音帧之前的语音序列划分为语句;
    确定所述语句中相邻两次丢帧部分的距离小于等于预置距离时,将所述相邻两次丢帧部分确定为一次丢帧事件;
    确定所述语句中相邻两次丢帧部分的距离大于预置距离时,将所述相邻两次丢帧部分确定为两次丢帧事件。
  10. 根据权利要求4、5或8所述的方法,其特征在于,所述根据所述非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,包括:
    根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数;
    根据所述丢失的语音帧总数计算语句的语音质量。
  11. 根据权利要求10所述的方法,其特征在于,所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度,则所述根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数,包括:
    确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
    确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及 损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;或者,
    确定离散丢帧时,根据所述语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数。
  12. 根据权利要求10所述的方法,其特征在于,所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则所述根据所述非语音参数将所述丢帧事件中不同位置下的丢失帧和不同离散分布下的丢失帧映射成丢失的语音帧总数,包括:
    确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
    确定离散丢帧时,根据所述平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数。
  13. 一种网络语音质量评估装置,其特征在于,包括:
    获取单元,用于获取网络语音的数据包,所述网络语音的数据包包括语音序列;
    解析单元,用于对获取单元获取到的数据包进行解析,得到解析结果;
    确定单元,用于根据解析单元得到的解析结果确定所述数据包的帧内容特性,所述帧内容特性包括静音帧和语音帧;
    划分单元,用于根据确定单元确定的帧内容特性对所述语音序列进行语句划分,并将划分得到的语句划分为多个丢帧事件;
    提取单元,用于根据划分单元划分的丢帧事件提取非语音参数,所述非语音参数包括位置参数和离散分布参数;
    评估单元,用于根据提取单元提取的非语音参数按照预置的语音质量评估模型对每条语句的语音质量进行评估,得到每条语句的语音质量,根据所述每 条语句的语音质量评估所述语音序列的语音质量。
  14. 根据权利要求13所述的网络语音质量评估装置,其特征在于,
    所述解析单元,具体用于对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;
    所述确定单元,具体用于根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,分别根据所述语音序列的时长、语音序列的比特数和语音负载确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,以及确定后一个相邻的未丢失帧的标记,根据所述前一个相邻的未丢失帧的帧内容特性、后一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的标记确定所述丢失帧部分的帧内容特性。
  15. 根据权利要求14所述的网络语音质量评估装置,其特征在于,
    所述确定单元,具体用于获取未丢失帧的实际有效载荷长度;根据所述语音负载、语音序列的比特数和语音序列的时长确定码率;若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度一致,则确定所述未丢失帧为语音帧;若所述码率所对应的标准有效载荷长度与所述实际有效载荷长度不一致,则确定所述未丢失帧为静音帧。
  16. 根据权利要求15所述的网络语音质量评估装置,其特征在于,
    所述确定单元,具体用于若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,或后一个相邻的未丢失帧的标记指示所述后一个相邻的未丢失帧为第一个语音帧,则确定所述丢失帧部分为静音帧,否则,确定所述丢失帧部分为语音帧。
  17. 根据权利要求16所述的网络语音质量评估装置,其特征在于,所述语音帧包括关键语音帧和非关键语音帧,则:
    所述确定单元,具体用于在确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为语音帧时,确定所述丢失帧部分为关键语音帧;在确定所述前一个相邻的未丢失帧为语音帧,以及所述后一个相邻的未丢失帧为静音帧时,确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;在确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为语音帧时,确定所述丢失帧部分的前一半部分为非关键 语音帧,所述丢失帧部分的后一半部分为关键语音帧。
  18. 根据权利要求13所述的网络语音质量评估装置,其特征在于,
    所述解析单元,具体用于对所述数据包的包头进行解析,得到解析结果,所述解析结果包括语音序列的时长、语音序列的比特数、丢帧位置和语音负载;根据所述语音负载进行自适应多速率AMR解码,得到AMR解码后语音信号;根据所述语音序列的时长和语音序列的比特数计算所述AMR解码后语音信号中每一帧的帧能量和平均帧能量;
    所述确定单元,具体用于根据所述丢帧位置在所述数据包中确定当前需要检测的丢失帧部分,根据计算出的帧能量和平均帧能量确定所述丢失帧部分的前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性,根据所述前一个相邻的未丢失帧的帧内容特性和后一个相邻的未丢失帧的帧内容特性确定所述丢失帧部分的帧内容特性。
  19. 根据权利要求18所述的网络语音质量评估装置,其特征在于,
    所述确定单元,具体用于若所述未丢失帧的帧能量小于等于0,则确定所述未丢失帧为静音帧;若所述未丢失帧的帧能量大于0小于平均帧能量,则确定所述未丢失帧为非关键语音帧;若所述未丢失帧的帧能量大于平均帧能量,则确定所述未丢失帧为关键语音帧。
  20. 根据权利要求19所述的网络语音质量评估装置,其特征在于,所述确定单元,具体用于:
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为静音帧,则确定所述丢失帧部分为静音帧;
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为关键语音帧,则确定所述丢失帧部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧和后一个相邻的未丢失帧均为非关键语音帧,则确定所述丢失帧部分为非关键语音帧;
    若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分的前一半部分为关键语音帧,所述丢失帧部分的后一半部分为非关键语音帧;
    若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢 失帧为关键语音帧,则确定所述丢失帧部分的前一半部分为非关键语音帧,所述丢失帧部分的后一半部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧为关键语音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为关键语音帧,则确定所述丢失帧部分为关键语音帧;
    若确定所述前一个相邻的未丢失帧为非关键语音帧,以及所述后一个相邻的未丢失帧为静音帧,则确定所述丢失帧部分为非关键语音帧;
    若确定所述前一个相邻的未丢失帧为静音帧,以及所述后一个相邻的未丢失帧为非关键语音帧,则确定所述丢失帧部分为非关键语音帧。
  21. 根据权利要求16、17和20所述的网络语音质量评估装置,其特征在于,
    所述划分单元,具体用于确定静音帧连续出现的帧数超过预置次数时,将所述静音帧之前的语音序列划分为语句;确定所述语句中相邻两次丢帧部分的距离小于等于预置距离时,将所述相邻两次丢帧部分确定为一次丢帧事件;确定所述语句中相邻两次丢帧部分的距离大于预置距离时,将所述相邻两次丢帧部分确定为两次丢帧事件。
  22. 根据权利要求16、17和20所述的网络语音质量评估装置,其特征在于,
    所述评估单元,具体用于根据所述非语音参数按照预置的语音质量评估模型对所述丢帧事件进行失真映射,得到丢失的语音帧总数;根据所述丢失的语音帧总数计算语句的语音质量。
  23. 根据权利要求22所述的网络语音质量评估装置,其特征在于,所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度;则所述评估单元,具体用于:
    确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
    确定离散丢帧时,根据所述语音帧丢帧次数、语音帧一次丢失长度、以及损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧 次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数;或者,
    确定离散丢帧时,根据所述语音帧一次丢失长度和损伤长度将丢帧事件中的损伤帧映射成丢失的语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的语音帧数将所述丢帧事件映射成丢失的语音帧总数。
  24. 根据权利要求22所述的网络语音质量评估装置,其特征在于,所述非语音参数包括非关键语音帧和关键语音帧之间的距离、语音帧丢帧次数、平均丢失长度和平均损伤长度,则所述评估单元,具体用于:
    确定连续丢帧时,根据所述非关键语音帧和关键语音帧之间的距离将丢帧事件中的非关键语音帧映射成丢失的关键语音帧数,根据所述语音帧丢帧次数确定实际丢失的关键语音帧数,根据所述实际丢失的关键语音帧数和映射得到的丢失的关键语音帧数将所述丢帧事件映射成丢失的语音帧总数;
    确定离散丢帧时,根据所述平均丢失长度和平均损伤长度将所述丢帧事件映射成丢失的语音帧总数。
  25. 一种通信系统,其特征在于,包括权利要求13至24任一项所述的网络语音质量评估装置。
PCT/CN2014/089401 2014-05-05 2014-10-24 一种网络语音质量评估方法、装置和系统 WO2015169064A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP14891207.4A EP3091720A4 (en) 2014-05-05 2014-10-24 Network voice quality evaluation method, device and system
US15/248,079 US10284712B2 (en) 2014-05-05 2016-08-26 Voice quality evaluation method, apparatus, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410186706.1 2014-05-05
CN201410186706.1A CN105100508B (zh) 2014-05-05 2014-05-05 一种网络语音质量评估方法、装置和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/248,079 Continuation US10284712B2 (en) 2014-05-05 2016-08-26 Voice quality evaluation method, apparatus, and system

Publications (1)

Publication Number Publication Date
WO2015169064A1 true WO2015169064A1 (zh) 2015-11-12

Family

ID=54392088

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/089401 WO2015169064A1 (zh) 2014-05-05 2014-10-24 一种网络语音质量评估方法、装置和系统

Country Status (4)

Country Link
US (1) US10284712B2 (zh)
EP (1) EP3091720A4 (zh)
CN (1) CN105100508B (zh)
WO (1) WO2015169064A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6586044B2 (ja) * 2016-05-13 2019-10-02 日本電信電話株式会社 音声品質推定装置、音声品質推定方法及びプログラム
CN108011686B (zh) * 2016-10-31 2020-07-14 腾讯科技(深圳)有限公司 信息编码帧丢失恢复方法和装置
US11276395B1 (en) * 2017-03-10 2022-03-15 Amazon Technologies, Inc. Voice-based parameter assignment for voice-capturing devices
EP3618061B1 (en) * 2018-08-30 2022-04-27 Tata Consultancy Services Limited Method and system for improving recognition of disordered speech
CN112291421B (zh) * 2019-07-24 2021-09-21 中国移动通信集团广东有限公司 基于语音通信的单通检测方法、装置及存储介质、电子设备
CN112767955B (zh) * 2020-07-22 2024-01-23 腾讯科技(深圳)有限公司 音频编码方法及装置、存储介质、电子设备
WO2022155844A1 (zh) * 2021-01-21 2022-07-28 华为技术有限公司 一种视频传输质量的评估方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070008899A1 (en) * 2005-07-06 2007-01-11 Shim Choon B System and method for monitoring VoIP call quality
CN103632680A (zh) * 2012-08-24 2014-03-12 华为技术有限公司 一种语音质量评估方法、网元及系统
CN103632679A (zh) * 2012-08-21 2014-03-12 华为技术有限公司 音频流质量评估方法及装置
CN103716470A (zh) * 2012-09-29 2014-04-09 华为技术有限公司 语音质量监控的方法和装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002162998A (ja) * 2000-11-28 2002-06-07 Fujitsu Ltd パケット修復処理を伴なう音声符号化方法
ATE442643T1 (de) 2003-01-21 2009-09-15 Psytechnics Ltd Verfahren und vorrichtung zur qualitätsbestimmung eines audiosignals
WO2006136900A1 (en) * 2005-06-15 2006-12-28 Nortel Networks Limited Method and apparatus for non-intrusive single-ended voice quality assessment in voip
US7856355B2 (en) 2005-07-05 2010-12-21 Alcatel-Lucent Usa Inc. Speech quality assessment method and system
KR101379417B1 (ko) 2006-04-28 2014-03-28 에어마그네트, 인코포레이티드 무선 로컬 영역 네트워크에서 인터넷 전화의 음성 품질 측정
CN101188525B (zh) * 2007-11-27 2011-10-26 成都市华为赛门铁克科技有限公司 一种语音流的处理方法及装置
CN102057634B (zh) 2008-06-11 2015-01-28 日本电信电话株式会社 音频质量估计方法以及音频质量估计设备
KR101585208B1 (ko) * 2008-07-02 2016-01-13 삼성전자주식회사 라우팅 및 게이트웨이 통합 VoIP 시스템에서 광대역포트로부터 수신되는 VoIP 미디어 패킷의 QoS 제어시스템 및 방법
WO2010070840A1 (ja) * 2008-12-17 2010-06-24 日本電気株式会社 音声検出装置、音声検出プログラムおよびパラメータ調整方法
US20120281589A1 (en) 2010-01-25 2012-11-08 Nec Corporation Audio quality measurement apparatus, audio quality measurement method, and program
CN102340426A (zh) 2010-07-26 2012-02-01 中国移动通信集团重庆有限公司 一种评估voip语音质量的方法及装置
ES2435673T3 (es) 2011-05-16 2013-12-20 Deutsche Telekom Ag Modelo de calidad de audio paramétrico para servicios IPTV
CN102496372A (zh) 2011-12-15 2012-06-13 中国传媒大学 一种基于非线性参数拟合的低码率音频质量客观评价方法
US20130191120A1 (en) * 2012-01-24 2013-07-25 Broadcom Corporation Constrained soft decision packet loss concealment
CN103839554A (zh) * 2012-11-26 2014-06-04 华为技术有限公司 语音质量评估的方法和装置
KR20140067512A (ko) * 2012-11-26 2014-06-05 삼성전자주식회사 신호 처리 장치 및 그 신호 처리 방법

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070008899A1 (en) * 2005-07-06 2007-01-11 Shim Choon B System and method for monitoring VoIP call quality
CN103632679A (zh) * 2012-08-21 2014-03-12 华为技术有限公司 音频流质量评估方法及装置
CN103632680A (zh) * 2012-08-24 2014-03-12 华为技术有限公司 一种语音质量评估方法、网元及系统
CN103716470A (zh) * 2012-09-29 2014-04-09 华为技术有限公司 语音质量监控的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3091720A4 *

Also Published As

Publication number Publication date
EP3091720A1 (en) 2016-11-09
CN105100508B (zh) 2018-03-09
US10284712B2 (en) 2019-05-07
US20160366274A1 (en) 2016-12-15
CN105100508A (zh) 2015-11-25
EP3091720A4 (en) 2017-05-03

Similar Documents

Publication Publication Date Title
WO2015169064A1 (zh) 一种网络语音质量评估方法、装置和系统
US11115541B2 (en) Post-teleconference playback using non-destructive audio transport
CN108900725B (zh) 一种声纹识别方法、装置、终端设备及存储介质
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
CN111128223B (zh) 一种基于文本信息的辅助说话人分离方法及相关装置
KR100636317B1 (ko) 분산 음성 인식 시스템 및 그 방법
WO2019227579A1 (zh) 会议信息记录方法、装置、计算机设备及存储介质
WO2021139425A1 (zh) 语音端点检测方法、装置、设备及存储介质
JP5006343B2 (ja) 不侵入の信号の品質評価
US11869516B2 (en) Voice processing method and apparatus, computer- readable storage medium, and computer device
US8380494B2 (en) Speech detection using order statistics
BR112016025110B1 (pt) Gerenciamento de perfil de voz e geração de sinal de fala
HUE035388T2 (en) Audio signal grading method and apparatus
BR112014023865B1 (pt) Método para a identificação de um segmento de áudio candidato a partir de uma chamada telefônica de saída, método para a criação de um bitmap ternário de um banco de dados de áudio a partir de uma chamada de saída e método para a criação de uma representação compacta ponderada de um conjunto de dados
WO2015103836A1 (zh) 一种语音控制方法及装置
BR112014017001B1 (pt) classificação de sinal de múltiplos modos de codificação
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
CN107895571A (zh) 无损音频文件识别方法及装置
Zhang et al. An efficient perceptual hashing based on improved spectral entropy for speech authentication
US10522160B2 (en) Methods and apparatus to identify a source of speech captured at a wearable electronic device
Ding et al. Non-intrusive single-ended speech quality assessment in VoIP
WO2021258958A1 (zh) 语音编码方法、装置、计算机设备和存储介质
JP2008172365A (ja) 受聴品質評価方法および装置
EP3238211B1 (en) Methods and devices for improvements relating to voice quality estimation
Gbadamosi et al. Non-Intrusive Noise Reduction in GSM Voice Signal Using Non-Parametric Modeling Technique.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14891207

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2014891207

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014891207

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE