WO2019228306A1 - 对齐语音的方法和装置 - Google Patents

对齐语音的方法和装置 Download PDF

Info

Publication number
WO2019228306A1
WO2019228306A1 PCT/CN2019/088591 CN2019088591W WO2019228306A1 WO 2019228306 A1 WO2019228306 A1 WO 2019228306A1 CN 2019088591 W CN2019088591 W CN 2019088591W WO 2019228306 A1 WO2019228306 A1 WO 2019228306A1
Authority
WO
WIPO (PCT)
Prior art keywords
test
sentence
voice
original
speech
Prior art date
Application number
PCT/CN2019/088591
Other languages
English (en)
French (fr)
Inventor
秦臻
叶强
田光见
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19811458.9A priority Critical patent/EP3764361B1/en
Priority to FIEP19811458.9T priority patent/FI3764361T3/fi
Publication of WO2019228306A1 publication Critical patent/WO2019228306A1/zh
Priority to US17/068,131 priority patent/US11631397B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/26Arrangements for supervision, monitoring or testing with means for applying test signals or for measuring
    • H04M3/28Automatic routine testing ; Fault testing; Installation testing; Test methods, test equipment or test arrangements therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2236Quality of speech transmission monitoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • the present application relates to the audio field, and in particular, to a method and device for aligning speech.
  • Abnormal speech recognition in communication networks is one of the problems faced by communication operators when solving abnormal speech problems. Due to the restrictions of user privacy protection policies, operation and maintenance engineers can only identify abnormal speech during repeated dial-up tests later. Then complete the scene reproduction of abnormal speech and the effect verification after the problem is fixed.
  • One method to improve the efficiency of recognizing problematic speech is to input the original speech and the test speech into the algorithm model, and use the algorithm model to identify the abnormal phenomenon of the test speech.
  • it is necessary to perform an alignment operation on the original speech and the test speech that is, to align the start time and end time domain positions of the original speech and the test speech.
  • the results may have large errors, which need to be overcome through multiple algorithms and multiple processing.
  • This application provides a method and device for aligning speech. Before aligning the original speech and the test speech, it is first detected whether the test speech has a speech loss phenomenon and / or a speech discontinuity phenomenon, and an appropriate algorithm is selected to align the test speech according to the detection result And original speech, which can improve the efficiency of aligned speech.
  • a method for aligning speech including: obtaining original speech and test speech, the test speech being generated after the original speech is transmitted over a communication network; performing missing detection and / or intermittent detection on the test speech, Among them, the missing detection is used to determine whether there is a missing voice in the test voice relative to the original voice, and the intermittent detection is used to determine whether there is a voice interruption in the test voice compared to the original voice; the alignment test is based on the results of the missing detection and / or the intermittent detection The speech and the original speech, to obtain the aligned test speech and the aligned original speech, wherein the results of the missing detection and / or intermittent detection are used to indicate the manner of the alignment test speech and the original speech.
  • the sentences contained in the original voice and the sentences contained in the test voice can be aligned in order to further determine whether the test voice exists
  • Other anomalies, where the absence of longer speech discontinuity means that the delay of the test speech sentence relative to the original speech sentence is less than the delay threshold; for example, if the above test result is that the test speech has a missing voice
  • the first sentence can align the sentences included in the original voice except the first sentence and the sentences included in the test voice in order to further determine whether there are other abnormal phenomena in the test voice; for example, if
  • the above detection result is that there is no missing speech but long speech discontinuity in the test speech, that is, the delay of the test speech sentence relative to the original speech sentence is greater than the delay threshold.
  • the delay threshold can be increased to facilitate further abnormal detection.
  • the method for aligning speech determines the method for aligning speech based on the results of missing detection and / or intermittent detection, and can use the most appropriate method for speech alignment according to the specific situation of the test speech, thereby improving the alignment of speech. effectiveness.
  • the original speech includes a first original sentence
  • the test speech includes a first test sentence.
  • the first original sentence corresponds to the first test sentence
  • the test speech is aligned with the result of the missing detection and / or intermittent detection.
  • Raw speech including:
  • the original speech includes a first original sentence
  • the test speech includes a first test sentence.
  • the first original sentence corresponds to the first test sentence
  • the test speech is aligned with the result of the missing detection and / or intermittent detection.
  • Raw speech including:
  • the starting time domain position of the first test sentence is after the starting time domain position of the first original sentence
  • the duration of the second silent sentence is equal to the time difference between the start time domain position of the first test sentence and the start time domain position of the first original sentence.
  • a piece of speech can be divided into multiple sentences.
  • Each sentence is a set of frames whose amplitude value exceeds a preset amplitude threshold.
  • the period of silence may be undetected speech.
  • the silent period may also be a set of at least one frame whose amplitude value is less than a preset amplitude threshold.
  • the silent period is a pause in the middle of two sentences.
  • a silent voice is inserted before the first test sentence, that is, the first silent sentence whose duration is equal to the starting time domain position of the first test sentence and the first The time difference between the starting time domain position of an original sentence; when the starting time domain position of the first test sentence is later than the starting time domain position of the first original sentence, delete a silent voice before the first test sentence, that is, , The second silent sentence, the duration of the second silent sentence is equal to the time difference between the starting time domain position of the first test sentence and the starting time domain position of the first original sentence, thereby aligning each sentence of the original speech and the test speech. Individual statements.
  • the detecting and The results of the intermittent detection alignment test speech and the original speech also include:
  • At least two test sentences are determined according to a silent period in the test speech, and the at least two test sentences include a first test sentence, wherein the silent period in the test speech is used to indicate a division position of the at least two test sentences.
  • the silent period can also be called a silent sentence or silent speech, which refers to an audio segment in which no speech activity is detected, or a set of at least one frame whose amplitude value is less than a preset amplitude threshold, such as the pause period correspondence between two sentences
  • a silent sentence or silent speech refers to an audio segment in which no speech activity is detected, or a set of at least one frame whose amplitude value is less than a preset amplitude threshold, such as the pause period correspondence between two sentences
  • the device can perform abnormal speech recognition processing on the test speech from the position of the 1st second of the test speech, and no longer performs abnormal speech recognition processing on the 0th to 1st part of the test speech, thereby reducing the The workload improves the efficiency of identifying abnormal speech.
  • the detecting before inserting the first silent sentence before the starting time domain position of the first test sentence, or before adding and deleting the second silent sentence before the starting time domain position of the first test sentence, the detecting based on the missing And / or the results of the intermittent detection alignment test speech and original speech, including:
  • the first sub-test sentence and the second sub-test sentence are determined according to a trough of the first test sentence, where the trough is a speech segment whose average amplitude of a frame in the first test sentence is less than or equal to an amplitude threshold, and the trough is used to indicate the first The division position of the first test statement and the second test statement;
  • the first sub-original sentence is determined according to the number of cross-correlation and the first sub-test sentence.
  • the number of cross-correlation is used to indicate the similarity between the speech segment of the first original sentence and the first sub-test sentence.
  • the first sub-original sentence is the first original sentence.
  • the trough may be a short pause in the middle of a sentence. Therefore, the first test sentence can be divided into at least two sub-test sentences according to the trough, and the two sub-test sentences can be aligned, so that the first test sentence can be aligned with the first original sentence.
  • the result is more accurate, which is conducive to improving the accuracy of subsequent abnormal speech recognition.
  • aligning the first sub-test sentence with the first sub-original sentence according to a time offset of the first time-domain position relative to the second time-domain position and using the time-domain position of the first sub-original sentence as a reference position Including: when the time offset is less than or equal to the delay threshold, the time offset based on the time offset of the first time domain position relative to the second time domain position and the time domain position of the first sub-original sentence as a reference position.
  • the delay threshold When the time offset is greater than the delay threshold, it means that the delay of the first sub-test sentence is larger than that of the first sub-original sentence, and the delay is likely to be caused by missing speech or long-term intermittent speech. It is no longer necessary to perform alignment processing on the first sub-test statement and directly output the abnormal result; when the time offset is less than the delay threshold, it means that the delay of the first sub-test statement compared to the first sub-statement is small.
  • the delay may be caused by a short period of intermittent speech, or the delay may be a normal delay caused by the communication network transmission.
  • the first sub-test statement may be aligned to further determine whether the first sub-test statement is There are also other anomalies. The above method can determine whether to perform intra-sentence alignment according to the actual situation, which improves the flexibility of sentence alignment.
  • the alignment of the test voice and the original voice according to the results of the missing detection and / or the intermittent detection further includes:
  • a third silent sentence is added after the termination time domain position of the test voice, and the duration of the third silent sentence is equal to A time difference between a termination time domain position of the test voice and a termination time domain position of the original voice.
  • the alignment of the test voice and the original voice according to the results of the missing detection and / or the intermittent detection further includes:
  • a fourth silent sentence is deleted after the ending time domain position of the test voice, and the length of the fourth silent sentence is equal to A time difference between a termination time domain position of the test voice and a termination time domain position of the original voice.
  • the alignment processing is performed on each sentence of the original voice and each sentence of the test voice, since the alignment process may be performed inside the test sentence, the duration of the test sentence changes. Therefore, when each sentence of the original voice and each of the test voice are aligned, After the alignment of the sentences is completed, a time deviation may occur between the termination time domain positions of the original voice and the test voice.
  • the above method can align the termination time domain positions of the original voice and the test voice.
  • the method before aligning the test voice and the original voice according to the results of the missing detection and / or the intermittent detection, the method further includes:
  • the original voice and the test voice are detected according to a preset abnormal voice detection model to determine whether the test voice belongs to the abnormal voice.
  • the preset abnormal voice detection model is a non-machine learning model, and the content detected by the non-machine learning model and the missing detection The detected content is different, and / or the content detected by the non-machine learning model is different from the content detected by the intermittent detection.
  • the method further includes:
  • the preset abnormal speech detection models are usually some common abnormal speech detection models. These detection models are highly targeted and can quickly detect one or more common abnormal speech detections. However, the preset abnormal speech detection models are Unusual abnormal speech cannot be detected, and the preset speech detection model may miss detection of common abnormal speech. According to the solution provided by this embodiment, firstly, a preset abnormal speech detection model is used to detect common abnormal speeches, and then a machine learning model is used to detect uncommon abnormal speeches. At the same time, a machine learning model is used to detect common abnormal speeches again, thereby enabling Improve the success rate of abnormal speech detection.
  • a device for determining the aligned speech can implement the functions corresponding to the steps in the method according to the first aspect, and the functions can be implemented by hardware or the corresponding software can be executed by hardware.
  • the hardware or software includes one or more units or modules corresponding to the functions described above.
  • the device includes a processor and a communication interface, and the processor is configured to support the device to perform a corresponding function in the method according to the first aspect.
  • the communication interface is used to support communication between the device and other network elements.
  • the device may also include a memory for coupling to the processor, which stores program instructions and data necessary for the device.
  • a computer-readable storage medium stores computer program code, and when the computer program code is executed by a processing unit or a processor, the apparatus for aligning speech performs the functions of the first aspect. The method described.
  • a chip in which instructions are stored that, when run on a speech-aligned device, cause the chip to execute the method of the first aspect.
  • a computer program product includes: computer program code, when the computer program code is aligned with a communication unit or communication interface of a voice device, and when a processing unit or processor is running, the alignment is performed.
  • the voice device performs the method of the first aspect described above.
  • FIG. 1 is a schematic structural diagram of an abnormal speech recognition system provided by the present application.
  • FIG. 2 is a schematic flowchart of a speech alignment method provided by the present application.
  • FIG. 3 is a schematic diagram of a method for aligning a test sentence and an original sentence provided by the present application
  • FIG. 4 is a schematic diagram of another method for aligning a test sentence and an original sentence provided by the present application
  • FIG. 5 is a schematic diagram of still another method for aligning a test sentence and an original sentence provided by the present application
  • FIG. 6 is a schematic diagram of another method for aligning a test sentence and an original sentence provided by the present application.
  • FIG. 7 is a schematic diagram of a sentence division method provided by the present application.
  • FIG. 8 is a schematic diagram of a method for dividing sub-sentences provided by the present application.
  • FIG. 9 is a schematic flowchart of an abnormal speech recognition method provided by the present application.
  • FIG. 10 is a schematic flowchart of an abnormal voice detection module provided by the present application.
  • FIG. 11 is a schematic diagram of a content deletion exception provided by the present application.
  • FIG. 12 is a schematic diagram of another content deletion exception provided by the present application.
  • FIG. 13 is a schematic diagram of still another content deletion exception provided by the present application.
  • FIG. 15 is a schematic flowchart of a voice preprocessing module provided by the present application.
  • 16 is a schematic structural diagram of an abnormal speech recognition module provided by the present application.
  • 17 is a schematic diagram of a training process based on a machine learning model provided by the present application.
  • FIG. 18 is a schematic diagram of a detection process based on a machine learning model provided by the present application.
  • FIG. 20 is a schematic diagram of a voice to be detected provided by the present application.
  • 21 is a schematic diagram of a sentence division result provided by the present application.
  • FIG. 23 is a schematic diagram of another to-be-detected voice provided by the present application.
  • FIG. 24 is a schematic structural diagram of a device for aligning speech provided in the present application.
  • FIG. 25 is a schematic structural diagram of another apparatus for aligning speech provided by the present application.
  • Abnormal speech refers to the phenomenon that users' subjective perception of poor speech quality during a call.
  • Common abnormal speech includes at least one of the following phenomena:
  • the normal voice is doped with interference sounds during the call, such as metal sounds, water sounds, etc., causing the user's hearing discomfort.
  • FIG. 1 shows a schematic diagram of an abnormal speech recognition system suitable for the present application.
  • the system 100 includes:
  • the voice input module 110 is configured to convert a sampling rate of the input voice.
  • the voice input module 110 may convert the sampling rate of the original voice and the sampling rate of the test voice into the same sampling rate, and the test voice is the original voice The voice obtained through the communication network transmission.
  • the test voice uses a 16K sampling rate and the original voice uses an 8K sampling rate.
  • the voice input module 110 can reduce the test voice sampling rate to 8K and then speak the original voice and test voice input abnormal voice. Detect module 121.
  • the abnormal voice detection module 121 detects whether there is an abnormal situation in the test voice and a specific type of the abnormal situation based on a non-machine learning model, such as an acoustic echo detection model, a discontinuous detection model, and a background noise detection model. Examples are low quality, intermittent, and noise.
  • the voice pre-processing module 122 is configured to align the test voice and the original voice, so as to facilitate subsequent abnormal voice detection.
  • aligned speech refers to the alignment of the start time domain position and the end time domain position of two pieces of speech. Since each speech segment of the two pieces of speech that have undergone the alignment process has a one-to-one correspondence, the alignment process It is easier to recognize abnormal speech when detecting two abnormal speech.
  • the abnormal speech recognition module 123 detects whether the test speech is in an abnormal situation and a specific type of the abnormal situation based on a machine learning model.
  • the machine learning model is, for example, a random forest model and a deep neural network model.
  • abnormal speech detection module 121 speech preprocessing module 122
  • abnormal speech recognition module 123 may be independent modules or integrated modules, for example, abnormal speech detection module 121, speech preprocessing module 122, and abnormal speech recognition.
  • the module 123 is integrated in the core abnormal speech recognition device 120.
  • the merge output module 130 merges and outputs the results of the abnormal voice detection module 121 and the abnormal voice recognition module 123 to detect the abnormal voice.
  • the merge processing refers to the detection of the abnormal voice detection module 121 and the abnormal voice recognition module 123. The same results are combined into one result. For example, if the abnormal voice detection module 121 and the abnormal voice recognition module 123 both detect noise in the test voice, the merge output module 130 combines the detected results of the abnormal voice detection module 121 and the abnormal voice recognition module 123. Only two voice anomalies (there is noise) are combined to output only one voice anomaly. For another example, if the abnormal voice detection module 121 detects that the test voice is intermittent and the abnormal voice recognition module 123 recognizes that there is noise in the test voice, the merge output module 130 may Two voice anomalies are output, that is, intermittent and noisy test voices.
  • the system 100 is only an example of an abnormal speech recognition system applicable to the present application.
  • the abnormal speech recognition system applicable to the present application may also have more or fewer modules than the system 100, for example, abnormal speech recognition applicable to the present application.
  • the system may further include a display module, or the abnormal speech recognition system applicable to the present application may not include the merge output module 130.
  • abnormal speech recognition method provided in the present application will be described in detail based on the abnormal speech recognition system 100 shown in FIG. 1 as an example.
  • FIG. 2 shows a schematic flowchart of a method for aligning speech provided in the present application.
  • the method 200 includes:
  • S210 may be performed by the abnormal voice detection module 121 or the abnormal voice recognition module 123.
  • the above-mentioned obtaining the original voice and the test voice may be receiving the original voice and the test voice from the voice input module 110, among which the original voice and the test from the voice processing module 110 Speech can be speech with the same sampling rate.
  • the above-mentioned obtaining of the original voice and the test voice may also obtain voices with different sampling rates through other modules.
  • the lack of speech belongs to one of the above-mentioned low-quality abnormal speech.
  • the method of performing missing detection and / or intermittent detection on the test speech may be performed by the abnormal speech detection module 121 or the abnormal speech recognition module 123.
  • specific detection methods refer to the existing technology.
  • the missing detection method and the discontinuous detection method are omitted here.
  • S230 Align the test voice and the original voice according to the results of the missing detection and / or discontinuous detection, and obtain the aligned test voice and the aligned original voice, wherein the results of the missing detection and / or discontinuous detection are used to indicate the alignment test. Way of speech and original speech.
  • S230 may be performed by the voice pre-processing module 122. For example, if the detection result is that there is no missing speech and long speech discontinuity in the test speech, the sentences included in the original speech and the sentences included in the test speech can be aligned in order.
  • Intermittent speech refers to the fact that the delay of the sentence of the test voice relative to the sentence of the original voice is less than the delay threshold; for example, if the above test result is that the test voice has a voice missing the first sentence, the original voice can be included Each sentence except the first sentence and each sentence included in the test voice are aligned in turn; for example, if the above detection result is that the test voice is not missing, but there is a long voice discontinuity, that is, the test voice The delay of the sentence relative to the original voice is greater than the delay threshold. In order to further determine whether there are other anomalies in the test voice, the delay threshold can be increased to facilitate further anomaly detection, or the test voice and the original can no longer be aligned. Voice, output abnormal results directly.
  • the method for aligning speech determines the method for aligning speech based on the results of missing detection and / or intermittent detection, and can use the most appropriate method for speech alignment according to the specific situation of the test speech, thereby improving the alignment of speech. effectiveness.
  • the method 200 may be implemented by program code running on a general-purpose processor, may also be implemented by special-purpose hardware devices, and may also be implemented by a combination of software and hardware (program code and special-purpose hardware devices).
  • the original speech includes a first original sentence
  • the test speech includes a first test sentence
  • the first original sentence corresponds to the first test sentence
  • S230 includes:
  • the starting time domain position of the first test sentence is after the starting time domain position of the first original sentence
  • the duration of the second silent sentence is equal to the time difference between the start time domain position of the first test sentence and the start time domain position of the first original sentence.
  • a piece of speech can be divided into multiple sentences.
  • Each sentence is a set of frames whose amplitude value exceeds a preset amplitude threshold.
  • the period of silence may be undetected speech.
  • the silent period may also be a set of at least one frame whose amplitude value is less than a preset amplitude threshold.
  • the silent period is a pause in the middle of two sentences.
  • the first original sentence is any sentence in the original speech.
  • the original speech may include only the first original sentence, or a sentence other than the first original sentence.
  • the test speech may include only the first test sentence. Statements other than the first test statement may be included.
  • the starting time domain positions of the original voice and the test voice are first aligned.
  • a silent voice is inserted before the first test sentence, that is, the first silent sentence whose duration is equal to the starting time domain position of the first test sentence and the first The time difference between the starting time domain position of an original sentence; when the starting time domain position of the first test sentence is later than the starting time domain position of the first original sentence, delete a silent voice before the first test sentence, that is, , The second silent sentence, the duration of the second silent sentence is equal to the time difference between the starting time domain position of the first test sentence and the starting time domain position of the first original sentence, thereby aligning each sentence of the original speech and the test speech. Individual statements.
  • the above “insertion” refers to adding a period of silent speech at any time domain position before the start time domain position of the first test sentence, so that the first test sentence moves a distance along the time axis as a whole, for example,
  • the starting time domain positions of the original speech and the test speech are both 0 seconds (s), that is, the starting time domain positions of the original speech and the test speech are aligned, and the starting time domain position of the first original sentence is 10s.
  • the start time domain position of a test sentence is 5s, that is, the start time domain position of the first test sentence is before the start time domain position of the first original sentence. At this time, the time between 0 and 5s of the test voice can be tested.
  • FIG. 3 to FIG. 6 respectively show several methods of alignment test sentences and original sentences provided by the present application.
  • the position (first position) of the first test sentence in the test speech is delayed relative to the position (second position) of the first original sentence in the original speech, in order to align the first original sentence
  • a silent voice can be deleted before the first test sentence.
  • the deleted silent voice is called the second silent sentence.
  • the length of the second silent sentence is equal to the delay of the first position relative to the second position. , Thereby aligning the first original statement with the first test statement.
  • time domain position is sometimes simply referred to as the “position”.
  • the position (first position) of the first test sentence in the test speech is advanced by a period of time relative to the position (second position) of the first original sentence in the original speech.
  • a silent voice can be added before the first test sentence.
  • the added silent voice is called the first silent sentence, and the length of the first silent sentence is equal to the time that the first position is advanced relative to the second position.
  • the first original sentence and the first test sentence are aligned.
  • the first test sentence is the last sentence of the test voice
  • the first original sentence is the last sentence of the original voice.
  • the first test sentence is aligned with the starting time domain position of the first original sentence.
  • a test sentence is internally aligned (for example, the sub-statement of the first test sentence is aligned with the sub-statement of the first original sentence), which causes the duration of the first test sentence to become longer. Therefore, the end position of the test voice is located in the original voice.
  • the termination position in order to align the test voice with the original voice, you can delete a silent voice before the termination position of the test voice.
  • the deleted silent voice is called the fourth silent sentence.
  • the length of the fourth silent sentence is equal to the end position of the test voice. Delayed relative to the end position of the original speech, aligning the original speech with the test speech.
  • the first test sentence is the last sentence of the test voice
  • the first original sentence is the last sentence of the original voice.
  • the first test sentence is aligned with the starting time domain position of the first original sentence.
  • a test sentence is internally aligned (for example, the sub-statement of the first test sentence is aligned with the sub-statement of the first original sentence), which causes the duration of the first test sentence to be shorter. Therefore, the end position of the test voice is located in the original voice.
  • the termination position is advanced.
  • the added silent voice is called the third silent sentence.
  • the length of the third silent sentence is equal to the end position of the test voice.
  • S230 before adding the first silent sentence before the starting time domain position of the first test sentence, or before deleting the second silent sentence before the starting time domain position of the first test sentence, S230 further includes:
  • At least two original sentences are determined according to a silent period in the original speech, and the at least two original sentences include a first original sentence, wherein the silent period in the original speech is used to indicate a division position of the at least two original sentences.
  • At least two test sentences are determined according to a silent period in the test speech, and the at least two test sentences include a first test sentence, wherein the silent period in the test speech is used to indicate a division position of the at least two test sentences.
  • the silent period can also be called a silent sentence or silent speech, which refers to an audio segment in which no speech activity is detected, or a set of at least one frame whose amplitude value is less than a preset amplitude threshold, such as the pause period correspondence between two sentences
  • a silent sentence or silent speech refers to an audio segment in which no speech activity is detected, or a set of at least one frame whose amplitude value is less than a preset amplitude threshold, such as the pause period correspondence between two sentences
  • the device can perform abnormal speech recognition processing on the test speech from the position of the 1st second of the test speech, and no longer performs abnormal speech recognition processing on the 0th to 1st part of the test speech, thereby reducing the The workload improves the efficiency of identifying abnormal speech.
  • FIG. 7 shows a schematic diagram of a sentence division method provided by the present application.
  • the voice activity detection algorithm (VAD) algorithm can be used to divide a voice sentence.
  • the VAD algorithm can be set as follows: When a voice segment contains at least 300ms, the continuous voice activity can be divided into a segment.
  • the voice shown in FIG. 7 includes three segments with continuous voice activity time exceeding 300 ms, and the voice can be divided into three segments.
  • segment S i and S i + 1 may be combined into a statement;
  • S230 before inserting the first silent statement before the start time domain position of the first test sentence, or before adding and deleting the second silent statement before the start time domain position of the first test sentence, S230 further includes:
  • the first sub-test sentence and the second sub-test sentence are determined according to a trough of the first test sentence, where the trough is a speech segment whose average amplitude of a frame in the first test sentence is less than or equal to an amplitude threshold, and the trough is used to indicate the first The division position of the sub-test statement and the second sub-test statement;
  • the first sub-original sentence is determined according to the number of cross-correlation and the first sub-test sentence.
  • the number of cross-correlation is used to indicate the similarity between the speech segment of the first original sentence and the first sub-test sentence.
  • the first sub-original sentence is the first original sentence.
  • the time domain position is the time domain position of the first sub-test sentence in the first test sentence
  • the second time domain position is the time domain position of the first sub-original sentence in the first original sentence
  • the trough may be a short pause in the middle of a sentence. Therefore, the first test sentence can be divided into at least two sub-test sentences according to the trough, and the two sub-test sentences can be aligned, so that the first test sentence can be aligned with the first original sentence.
  • the result is more accurate, which is conducive to improving the accuracy of subsequent abnormal speech recognition.
  • aligning the first sub-test sentence with the first sub-original sentence according to a time offset of the first time-domain position relative to the second time-domain position and using the time-domain position of the first sub-original sentence as a reference position include:
  • the first time domain position is determined based on the time offset of the first time domain position relative to the second time domain position and using the time domain position of the first sub-original sentence as a reference position.
  • the child test statement is aligned with the first child original statement.
  • the delay threshold When the time offset is greater than the delay threshold, it means that the delay of the first sub-test sentence is larger than that of the first sub-sentence sentence.
  • the delay is likely to be caused by lack of speech or longer speech discontinuities. No longer perform alignment processing on the first sub-test statement, and directly output an abnormal result; when the time offset is less than the delay threshold, it means that the delay of the first sub-test statement relative to the first sub-statement is small, the delay It may be caused by a short speech interruption, or the delay may be a normal delay caused by communication network transmission.
  • the first sub-test statement may be aligned to further determine whether the first sub-test statement still exists. For other abnormal phenomena, the above method can determine whether to perform intra-sentence alignment according to the actual situation, which improves the flexibility of sentence alignment.
  • FIG. 8 shows a schematic diagram of a method for dividing sub-sentences provided by the present application.
  • the first test sentence is divided into several frames according to a frame length of 20 ms and a frame shift of 10 ms, and the average amplitude of the speech waveform in each frame is calculated. If the average amplitude of a frame is less than 200, the frame is regarded as a trough. Taking the trough as a dividing point, the first test sentence is divided into several sub-statements (ie, sub-test sentences).
  • a speech segment corresponding to the sub-test sentence x i (x i is any sub-test sentence in the first test sentence) in the first original sentence y is calculated, and x is calculated i with respect to the speech segment delay ⁇ i, calculated as follows:
  • corr (x i , y) is the most similar position of the sub-test sentence x i in the first original sentence y calculated using the correlation number, that is, the speech segment corresponding to the sub-test sentence x i is in the first original Position in statement y, Is the offset of the sub-test sentence x i in the first test sentence, n is the deviation when calculating the number of correlations, N is the time length of the sub-test sentence x i , and M is the time length of the first original sentence y.
  • ⁇ i is larger than a preset abnormal delay threshold (also referred to as “delay threshold”), it means that the delay of x i is relatively large compared to the sub-original sentence in the first original sentence.
  • the delay is likely to be Due to the lack of speech or a long period of intermittent speech, the alignment processing on x i may no longer be performed, or in order to determine whether there are other abnormalities in x i , the abnormal delay threshold may be increased to facilitate further processing of x i abnormal detection. If ⁇ i is less than the preset abnormal delay threshold, it means that the delay of x i relative to the sub-original sentence in the first original sentence is small.
  • the delay may be the delay or communication caused by the short speech interruption. For a normal delay caused by network transmission, an alignment process can be performed on x i to further determine whether there are other abnormal phenomena in x i .
  • S230 further includes:
  • a third silent sentence is added after the termination time domain position of the test voice, and the duration of the third silent sentence is equal to A time difference between a termination time domain position of the test voice and a termination time domain position of the original voice;
  • a fourth silent sentence is deleted after the ending time domain position of the test voice, and the length of the fourth silent sentence is equal to A time difference between a termination time domain position of the test voice and a termination time domain position of the original voice.
  • the alignment processing is performed on each sentence of the original voice and each sentence of the test voice, since the alignment process may be performed inside the test sentence, the duration of the test sentence changes. Therefore, when each sentence of the original voice and each of the test voice are aligned, After the alignment of the sentences is completed, a time deviation may occur between the termination time domain positions of the original voice and the test voice.
  • the above method can align the termination time domain positions of the original voice and the test voice.
  • the method 200 further includes:
  • the original voice and the test voice are detected according to a preset abnormal voice detection model to determine whether the test voice belongs to the abnormal voice.
  • the preset abnormal voice detection model is a non-machine learning model, and the content detected by the non-machine learning model and the missing detection The detected content is different, and / or the content detected by the non-machine learning model is different from the content detected by the intermittent detection.
  • the preset abnormal speech detection models are usually some common abnormal speech detection models (non-machine learning models), these detection models are highly targeted and can quickly detect one or more common abnormal speech. Therefore, The above steps can quickly determine whether the test voice has common anomalies.
  • the method 200 further includes:
  • a preset abnormal voice detection model is first used to detect common abnormal voices, and then a machine learning model is used to detect the test voice to determine whether the test voice has an unknown abnormal phenomenon, and / or, determine the test voice. Whether there is an abnormal phenomenon that is not detected by the non-machine learning model, which can improve the detection probability of the test voice abnormal phenomenon.
  • FIG. 9 is a schematic flowchart of an abnormal speech recognition method provided by the present application.
  • the pair of voices (original voice and test voice) input by the user are first converted by the voice input module 110.
  • the converted two voices (original voice and test voice) are transmitted to the abnormal voice detection module 121, which determines whether the test voice is There are abnormal problems such as silence and small energy. If an abnormality is detected, the result of the abnormality detection is transmitted to the merge output module 130 as the final abnormality identification result. If no abnormality is detected, the undetected abnormality is transmitted to the voice. Speech preprocessing module 122.
  • the speech input to the speech pre-processing module 122 is transmitted to the abnormal speech detection module 121.
  • the abnormal voice detection module 121 judges whether the test voice has abnormal problems such as missing or intermittent sentences for the second incoming voice. If an abnormality is detected, the result of the abnormality is transmitted to the merge output module 130 as The final abnormality recognition result, if no abnormality is detected, the undetected abnormal speech is transmitted to the speech preprocessing module 122, and the speech preprocessing module 122 performs temporal alignment on the second incoming speech, and aligns The next two segments of speech are transmitted to the abnormal speech recognition module 123 for further abnormal recognition, and then the recognition result is output to the merge output module 130. Finally, the merge output module 130 merges the results of the abnormal voice detection module 121 and the abnormal voice recognition module 123 as the final detection result of the pair of voices.
  • the flow of abnormal detection performed by the abnormal voice detection module 121 is shown in FIG. 10.
  • the steps of the method shown in FIG. 10 include:
  • Mute judgment 1001 Use the VAD algorithm to perform sliding window detection on the two input voices, and record the endpoints of each voice segment. If the algorithm does not detect voice activity on the test voice and detects voice activity on the original voice, it considers that there is a mute anomaly in the test voice, and passes the mute anomaly as the abnormality detection result to the merge output module 130, otherwise, execute Energy judgment 1002.
  • Low energy judgment 1002 If no mute abnormality is detected in the mute judgment in the previous step, the loudness values of the original voice and the test voice are calculated respectively in this step.
  • the loudness loss value of the test voice compared to the original voice (test voice loudness-original voice loudness) is input into the classifier A with a small energy abnormality judgment. If the classification result of the classifier A is abnormal, the test voice is considered to have a small energy abnormality.
  • the small energy anomaly is passed to the merge output module 130 as an abnormality detection result, otherwise, the group of voices is passed to the voice pre-processing module 122.
  • Sentence missing judgment 1003 After the speech pre-processing module 122 finishes the signal pre-processing of the sentence division, the speech pre-processing module 122 passes the processing result to the abnormal speech detection module 121, and performs the abnormal judgment of the missing sentence. After speech preprocessing, the two speech segments have been divided into several sentences according to the speech activity, and the result of sentence division based on the silent period is obtained. Compare the number of sentences divided by the original speech (Utt ref ) and the number of sentences divided by the test speech (Utt de ). If Utt ref ⁇ Utt de , the abnormal speech detection module 121 determines that the test sentence has a missing content exception.
  • the abnormal voice detection module 121 judges that the test sentence has a content missing exception, and transmits the content missing exception as the abnormal detection result to the merge output module 130, otherwise, the abnormal voice The detection module 121 performs intermittent judgment 1004.
  • Intermittent judgment 1004 If no abnormality is detected in the previous sentence missing judgment, it is judged in this step whether there is an intermittent problem in the test voice. Use the endpoint information of the speech segment recorded during the sentence segmentation process to calculate the silent period duration in each sentence in the original speech and the test speech, and enter the difference between the silent period duration between the test sentence and the original sentence into the classifier for intermittent anomaly judgment.
  • the abnormal voice detection module 121 determines that there is a discontinuous abnormality in the test voice, and transmits the discontinuous abnormality as the abnormality detection result to the merge output module 130, otherwise, the group of voices Into the speech preprocessing module 122 again.
  • Figure 14 is an example of judging the intermittent abnormality based on the silent period.
  • the length of the silent period between the speech segments S11 and S12 is len 1
  • the silence between the segments S21 and S22 in the second sentence The length of the period is len 2
  • the length of the silent period is len 1 'and len' 2 respectively
  • len 1 -len 1 'and len 2 -len' 2 are input into the classifier B respectively. Since the difference between len 2 and len ' 2 is detected as abnormal by the classifier B, there is an intermittent abnormality problem in the test voice.
  • the classifiers A and B used in the above abnormality test can be obtained by a machine learning method based on a training data set.
  • the training scheme of classifier A is as follows:
  • the standard training method is used to train the classifier A by using the difference in loudness and its corresponding sample label to obtain the classifier parameters.
  • the standard training method is used to train the classifier B by using the time difference of each silent period and its corresponding label to obtain the classifier parameters.
  • FIG. 15 shows a schematic working flowchart of a speech preprocessing module 122 provided in the present application. The steps involved in this workflow are as follows.
  • Signal pre-processing 1501 In order to reduce the difference in system gain between different speech systems and highlight the frequency components important to auditory perception, the speech pre-processing module 122 will adjust the level of the two segments of speech to the standard hearing level, and use the bandpass The filter performs filtering.
  • Sentence segmentation 1502 Based on the endpoints of the speech segments recorded during the silence judgment in the abnormal speech detection module, sentence segmentation is performed on the original speech and the test speech, and the result of the sentence segmentation is transmitted to the speech anomaly detection module 121.
  • sentence division For an example of sentence division, refer to the method shown in FIG. 7.
  • Time alignment 1503 When the test voice and the original voice enter the voice preprocessing module again, it means that the test voice has passed the abnormal detection of sentence missing and sentence discontinuity, and the problem of sentence missing and sentence discontinuity is not detected in the test voice. Therefore, it can be determined that the sentence in the test voice corresponds to the sentence in the original voice, and the in-sentence alignment processing can be performed on the test sentence.
  • Each sentence of the test voice is divided into different sub-test sentences based on the method shown in FIG. 8. If ⁇ i is greater than 0, it indicates that the time-domain position of the sub-test sentence x i in the first test sentence is greater than that of the corresponding voice segment.
  • a time domain position in an original sentence is further behind, and a trough (a time length of the trough equal to a value of ⁇ i ) may be removed forward from a starting point (starting time domain position) of x i as a starting point.
  • x i statement subtest described temporal position in the first test statement original temporal position in the first statement is more forward than the speech segments corresponding to start point x i (start The time domain position) is the starting point, and a silent period segment is inserted backward (the length of the silent period segment is equal to the absolute value of ⁇ i ). If ⁇ i is equal to 0, it indicates that the time domain position of the sub-test sentence x i in the first test sentence is the same as the time domain position of the corresponding voice segment in the first original sentence, and no alignment processing is required.
  • the first test sentence in the test voice without in-sentence alignment processing may be replaced with the first test sentence after in-sentence alignment processing, and based on the results shown in FIGS. 3 to 5
  • the method shown aligns the first test sentence with the first original sentence.
  • FIG. 16 shows a schematic structural diagram of an abnormal speech recognition module 123 provided in the present application.
  • the abnormal speech recognition module 123 performs abnormal detection on the original speech and the test speech based on the machine learning model.
  • the abnormal voice recognition module 123 performs abnormal voice detection processes including a training process and a detection process.
  • the training process is an optional process.
  • the abnormal voice recognition module 123 may use the trained model to perform the detection process.
  • the training process is shown below.
  • the abnormal voice recognition module 123 In order to describe the difference between the test voice and the original voice, the abnormal voice recognition module 123 first extracts the voice feature parameters on the two voices in units of frames.
  • the voice features include but are not limited to the voices shown in Table 1. Characteristics; then the abnormal speech recognition module 123 calculates the difference between each set of characteristic parameters in the two speeches, for example, the difference between the Mel-frequency cepstral coefficients (MFCC) of the original speech and the MFCC of the test speech; The feature difference of the entire segment of speech (original and test speech).
  • the statistical characteristics of each set of feature parameters of the entire segment of speech are extracted using statistical methods including but not limited to Table 2. .
  • Anomaly recognition Use machine learning models (random forests, deep neural networks, etc.) to learn what kind of abnormal speech the test speech belongs to, and which specific abnormal speech type it belongs to.
  • the abnormal speech types are not limited to five categories: mute, small energy, intermittent, noise, and low quality, and can also be subdivided into silent, small energy, intermittent, metal sounds, running water, missing content, echo, distortion, etc. More specific types.
  • the training process based on the machine learning model is shown in Figure 17.
  • the feature extraction is completed for T training samples, all the obtained difference description features will be input to their respective anomaly labels (no anomalies or specific types of anomalies).
  • the learned anomaly recognition model is obtained.
  • the abnormal recognition model mainly includes the mapping relationship between the difference description feature x and the label y.
  • the probability (or score) of the pair of speech belonging to various types of anomalies is calculated based on the above-mentioned machine learning model, wherein the above-mentioned machine learning model includes a correspondence relationship between various types of anomalies and different features.
  • the abnormality type with the highest probability (or score) is identified as the abnormal classification result.
  • the test voice in the pair of voices may be considered to be normal voice.
  • FIG. 19 is a schematic flowchart of another abnormal speech recognition method provided by the present application.
  • the test voice and the original voice enter the abnormal voice detection module 121 again for the second part of the abnormality check, that is, the abnormal voice detection module 121 performs the missing judgment 1003 and the intermittent judgment 1004.
  • Utt ref and Utt de are both 2, l ref / l de > 0.9, the silent period is not detected in the two sentences of the test voice, excluding the possible missing and intermittent problems in the test voice, and then the abnormal voice detection module 121 passes the test voice and the original voice into the voice Further processing is performed in the pre-processing module 122.
  • the speech preprocessing module 122 After detection by the abnormal voice detection module 121, the above test voice does not have the problems of missing content and intermittent.
  • the speech preprocessing module 122 performs sub-statement division, sub-statement delay calculation, and intra-sentence alignment on each test sentence in turn, and uses the aligned test sentences to complete the alignment between the sentences. The results of the alignment are shown in FIG. 22 .
  • the feature extractor 1231 extracts the difference features between the test voice and the original voice.
  • the anomaly recognizer 1232 classifies the difference features.
  • the test voice in the above example is recognized as an abnormal voice with running water.
  • the anomaly recognizer 1232 passes the result to the merge. Output module 130.
  • the combined output module 130 displays the output result of the abnormal speech recognition module 123 as the final output to the user:
  • test voice is abnormal and there is a problem of noise (running water).
  • the test voice has a significant delay problem compared to the original voice, and the effect of the doped noise on the waveform is not obvious. Therefore, the detection based on the non-machine learning model fails to identify the anomaly, and the time alignment is performed. 1503. Each sentence and sub-sentence in the test voice can be quickly aligned with the fragments in the original voice. At this time, the abnormality of the test voice is detected by the abnormality detection model based on the machine learning model, thereby improving the abnormality detection efficiency.
  • First input a voice to be tested as shown in FIG. 23 (both at 8K sampling rate), and then perform the first part of the abnormal voice detection module 121 detection, that is, execute the mute judgment 1001 and the low energy judgment 1002 to eliminate the possibility of test voice There are silent anomalies and small energy anomalies.
  • the speech pre-processing module 122 is then used to perform signal pre-processing 1501 and sentence division 1502.
  • the test voice and the original voice enter the abnormal voice detection module 121 again for the second part of the abnormality check, that is, the abnormal voice detection module 121 performs the missing judgment 1003 and the intermittent judgment 1004.
  • Utt ref and Utt de are both equal, and l ref / l de > 0.9, to exclude possible missing content in the test speech. If the difference between the silent period duration of the test sentence and the original sentence is greater than the preset discontinuity threshold T d , it is considered that there is a discontinuity in the test voice.
  • An abnormal problem, and the abnormal result is directly passed into the merge output module 130 for further processing.
  • the combined output module 130 displays the output result of the abnormal voice detection module 121 as the final output to the user:
  • the merge output module 130 may also buffer the detection result of the abnormal voice detection module 121, wait for the detection result of the abnormal voice recognition module 123, and combine and output the detection results of the two, so that it can be more comprehensive. Detect abnormal phenomena in test voice.
  • the voice-aligning device includes a hardware structure and / or a software module corresponding to each function.
  • this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is performed by hardware or computer software-driven hardware depends on the specific application and design constraints of the technical solution. A professional technician can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
  • This application can divide the functional units of the voice-aligned device according to the above method example.
  • each functional unit can be divided corresponding to each function in the manner shown in FIG. 2, or two or more functions can be integrated into one.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit. It should be noted that the division of units in this application is schematic, and is only a logical function division. There may be another division manner in actual implementation.
  • FIG. 24 shows a schematic diagram of a possible structure of the device for aligning speech in the foregoing embodiment.
  • the device 2400 for aligning speech includes an acquiring unit 2401, a detecting unit 2402, and an aligning unit 2403.
  • the detecting unit 2402 and the aligning unit 2403 are used for the device 2400 for supporting aligned speech to perform the detection and alignment steps shown in FIG. 2.
  • the obtaining unit 2401 obtains an original voice and a test voice.
  • the acquisition unit 2401, the detection unit 2402, and the alignment unit 2403 may also be used to perform other processes of the techniques described herein.
  • the voice-aligned device 2400 may further include a storage unit for storing program code and data of the voice-aligned device 2400.
  • the obtaining unit 2401 is configured to obtain an original voice and a test voice, where the test voice is a voice generated after the original voice is transmitted through a communication network;
  • the detecting unit 2402 is configured to perform missing detection and / or intermittent detection on the test voice obtained by the obtaining unit 2401.
  • the missing detection is used to determine whether there is a missing voice in the test voice compared to the original voice
  • the intermittent detection is used to determine Test whether the voice is intermittent with respect to the original voice;
  • the alignment unit 2403 is configured to align the test voice and the original voice according to the result of the detection and / or discontinuity detection by the detection unit 2402 to obtain the aligned test voice and the aligned original voice, wherein the missing detection and / or discontinuity
  • the result of the detection is used to indicate how the test speech and the original speech are aligned.
  • the detection unit 2402 and the alignment unit 2403 may be components of a processing unit.
  • the processing unit may be a processor or a controller.
  • the processing unit may be a central processing unit (CPU), a general-purpose processor, or a digital signal processor.
  • processor DSP
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • It may implement or execute various exemplary logical blocks, modules, and circuits described in connection with the disclosure of this application.
  • the processor may also be a combination that realizes computing functions, for example, a combination including one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
  • the obtaining unit 2401 may be a transceiver or a communication interface.
  • the storage unit may be a memory.
  • the processing unit is a processor
  • the obtaining unit 2401 is a communication interface
  • the storage unit is a memory
  • the device for aligning speech involved in this application may be the device shown in FIG. 25.
  • the apparatus 2500 includes a processor 2502, a communication interface 2501, and a memory 2503.
  • the communication interface 2501, the processor 2502, and the memory 2503 can communicate with each other through an internal connection path, and transfer control and / or data signals.
  • the speech aligning device 2400 and the speech aligning device 2500 provided in this application determine a method for aligning speech based on the results of missing detection and / or intermittent detection, and can use the most appropriate method for speech alignment according to the specific situation of the test speech, thereby Improved the efficiency of aligning speech.
  • the device embodiment and the method embodiment correspond completely, and the corresponding module executes the corresponding steps, for example, the obtaining unit executes the obtaining step in the method embodiment, and steps other than the obtaining step may be performed by a processing unit or a processor.
  • the corresponding module executes the corresponding steps
  • the obtaining unit executes the obtaining step in the method embodiment
  • steps other than the obtaining step may be performed by a processing unit or a processor.
  • a processing unit or a processor for the function of the specific unit, reference may be made to the corresponding method embodiment, which will not be described in detail.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of this application.
  • the steps of the method or algorithm described in combination with the disclosure of this application may be implemented in a hardware manner, or may be implemented in a manner that a processor executes software instructions.
  • Software instructions can be composed of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), erasable programmable read-only memory (erasable (programmable ROM, EPROM), electrically erasable programmable read-only memory (EPROM), registers, hard disks, mobile hard disks, read-only optical disks (CD-ROMs), or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be an integral part of the processor.
  • the processor and the storage medium may reside in an ASIC.
  • the ASIC may be located in a speech-aligned device.
  • the processor and the storage medium may also exist as discrete components in a voice-aligned device.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions according to the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted through the computer-readable storage medium.
  • the computer instructions may be transmitted from a website site, computer, server, or data center through wired (for example, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (for example, infrared, wireless, microwave, etc.) Another website site, computer, server, or data center for transmission.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes one or more available medium integration.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital versatile disc (DVD), or a semiconductor medium (for example, a solid state disk (SSD)) Wait.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种对齐语音的方法,包括:获取原始语音和测试语音,该测试语音为原始语音经过通信网络传输后生成的语音(S210);对测试语音执行缺失检测和/或断续检测,其中,缺失检测用于确定测试语音相对于原始语音的是否存在语音缺失,断续检测用于确定测试语音相对于原始语音是否存在语音断续(S220);根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,得到对齐后的原始语音和对齐后的测试语音,其中,缺失检测和/或断续检测的结果用于指示对齐测试语音和原始语音的方式(S230)。该对齐语音的方法,根据缺失检测和/或断续检测的结果确定对齐语音的方法,可以根据测试语音的具体情况使用最合适的方法进行语音对齐,从而提高了对齐语音的效率。

Description

对齐语音的方法和装置
本申请要求于2018年05月28日提交中国专利局、申请号为201810519857.2、申请名称为“对齐语音的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频领域,尤其涉及一种对齐语音的方法和装置。
背景技术
通信网络中的异常语音识别是通信运营商在解决异常语音问题时所面临的问题之一,受到用户隐私保护政策的限制,运维工程师只能在后期的重复拨测中进行异常语音的识别,进而完成异常语音的场景复现与问题修复后的效果验证。
一种提高识别问题语音的效率的方法是将原始语音和测试语音输入算法模型,通过算法模型来识别测试语音的异常现象。为了提高异常语音识别的准确率,需要对原始语音和测试语音执行对齐操作,即,对齐原始语音和测试语音的起始时域位置和终止时域位置,然而,根据现有对齐语音的方法得到的结果可能会出现较大的误差,这种误差需要通过多种算法以及多次处理才能克服。
发明内容
本申请提供了一种对齐语音的方法和装置,在对齐原始语音和测试语音前首先检测该测试语音是否存在语音缺失现象和/或语音断续现象,并根据检测结果选择合适的算法对齐测试语音和原始语音,从而可以提高对齐语音的效率。
第一方面,提供了一种对齐语音的方法,包括:获取原始语音和测试语音,该测试语音为原始语音经过通信网络传输后生成的语音;对测试语音执行缺失检测和/或断续检测,其中,缺失检测用于确定测试语音相对于原始语音的是否存在语音缺失,断续检测用于确定测试语音相对于原始语音是否存在语音断续;根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,得到对齐后的测试语音和对齐后的原始语音,其中,缺失检测和/或断续检测的结果用于指示对齐测试语音和原始语音的方式。
例如,若上述检测结果为测试语音不存在语音缺失和较长的语音断续,则可以将原始语音所包含的各个语句以及测试语音所包含的各个语句依次对齐,以便于进一步确定测试语音是否存在其它异常现象,其中,上述不存在较长的语音断续指的是测试语音的语句相对于原始语音的语句的时延小于时延阈值;又例如,若上述检测结果为测试语音存在语音缺失了第一个语句,则可以将原始语音所包含的除第一个语句之外的各个语句以及测试语音所包含的各个语句依次对齐,以便于进一步确定测试语音是否存在其它异常现象;再例如,若上述检测结果为测试语音不存在语音缺失但存在较长的语音断续,即,测试语音的 语句相对于原始语音的语句的时延大于时延阈值,为了进一步确定测试语音是否存在其它异常现象,可以增大时延阈值,以便于进行进一步的异常检测。
因此,本申请提供的对齐语音的方法,根据缺失检测和/或断续检测的结果确定对齐语音的方法,可以根据测试语音的具体情况使用最合适的方法进行语音对齐,从而提高了对齐语音的效率。
可选地,原始语音包括第一原始语句,测试语音包括第一测试语句,第一原始语句与所述第一测试语句对应,所述根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,包括:
当测试语音不存在语音缺失和/或语音断续时,且当第一测试语句的起始时域位置在第一原始语句的起始时域位置之前时,在第一测试语句的起始时域位置之前插入第一静默语句,使得所述第一测试语句的起始时域位置对齐所述第一原始语句的起始时域位置,第一静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差。
可选地,原始语音包括第一原始语句,测试语音包括第一测试语句,第一原始语句与所述第一测试语句对应,所述根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,包括:
当测试语音不存在语音缺失和/或语音断续时,且当第一测试语句的起始时域位置在第一原始语句的起始时域位置之后时,在第一测试语句的起始时域位置之前删除第二静默语句,第二静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差。
一段语音可以被划分为多个语句,每个语句即多个振幅值超过预设振幅阈值的帧的集合,任意两个相邻的语句之间存在一段静默期,静默期可以是未检测到语音活动的音频片段,静默期也可以是至少一个振幅值小于预设振幅阈值的帧的集合,例如,静默期为两句话中间的停顿。当测试语音不存在语音缺失和/或语音断续时,按照本实施例提供的方案,首先对齐原始语音和测试语音的起始时域位置,当第一测试语句的起始时域位置比第一原始语句的起始时域位置提前时,在第一测试语句前插入一段静默语音,即,第一静默语句,该第一静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差;当第一测试语句的起始时域位置比第一原始语句的起始时域位置靠后时,在第一测试语句前删除一段静默语音,即,第二静默语句,该第二静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差,从而对齐了原始语音的各个语句和测试语音的各个语句。
可选地,在第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在第一测试语句的起始时域位置之前删除第二静默语句之前,所述根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,还包括:
根据原始语音中的静默期确定至少两个原始语句,该至少两个原始语句包括第一原始语句,其中,原始语音中的静默期用于指示至少两个原始语句的划分位置;
根据测试语音中的静默期确定至少两个测试语句,该至少两个测试语句包括第一测试语句,其中,测试语音中的静默期用于指示至少两个测试语句的划分位置。
静默期也可以称为静默语句或者静默语音,指的是未检测到语音活动的音频片段,或 者,至少一个振幅值小于预设振幅阈值的帧的集合,例如两句话之间的停顿时段对应的音频片段,根据本实施例提供的技术方案,可以仅对测试语音中的测试语句(或者具有语音活动的音频片段)进行异常语音识别处理,不再对测试语音中的静默语句(或者静默期)进行异常语音识别处理,例如,测试语音和原始语音的时长均为10秒,并且,测试语音的起始位置和原始语音的起始位置分别存在一段时长为1秒的静默期,则对齐语音装置可以从测试语音的第1秒的位置开始对测试语音进行异常语音识别处理,不再对测试语音的第0秒至第1秒的部分进行异常语音识别处理,从而减小了识别异常语音的工作量,提高了识别异常语音的效率。
可选地,在第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在第一测试语句的起始时域位置之前添删除第二静默语句之前,所述根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,还包括:
根据第一测试语句的波谷确定第一子测试语句和第二子测试语句,其中,所述波谷为第一测试语句中帧的振幅均值小于或等于振幅阈值的语音片段,该波谷用于指示第一子测试语句和第二子测试语句的划分位置;
根据互相关系数和第一子测试语句确定第一子原始语句,互相关系数用于指示第一原始语句的语音片段与第一子测试语句的相似度,第一子原始语句为第一原始语句的语音片段中与第一子测试语句的相似度最高的语音片段;
根据第一时域位置相对于第二时域位置的时间偏移量并且以第一子原始语句的时域位置为参照位置将第一子测试语句与第一子原始语句对齐,其中,第一时域位置为第一子测试语句在第一测试语句中的时域位置,第二时域位置为第一子原始语句在第一原始语句中的时域位置。
波谷可能是一句话中间的短暂停顿,因此,可以依据波谷将第一测试语句划分为至少两个子测试语句,并对齐该两个子测试语句,从而可以使得第一测试语句与第一原始语句的对齐结果更加精确,有利于提高后续异常语音识别的准确度。
可选地,根据第一时域位置相对于第二时域位置的时间偏移量并且以第一子原始语句的时域位置为参照位置将第一子测试语句与第一子原始语句对齐,包括:当时间偏移量小于或等于时延阈值时,根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将第一子测试语句与第一子原始语句对齐。
当时间偏移量大于时延阈值时,说明第一子测试语句相对于第一子原始语句的时延较大,该时延很可能是由于语音缺失或者较长时间的语音断续导致的,可以不再对第一子测试语句执行对齐处理,直接输出异常结果;当时间偏移量小于时延阈值时,说明第一子测试语句相对于第一子原始语句的时延较小,该时延可能是较短时间的语音断续导致的,或者,该时延可能是通信网络传输导致的正常时延,可以对第一子测试语句执行对齐处理,以便于进一步确定第一子测试语句是否还存在其它异常现象,上述方法可以根据实际情况确定是否进行语句内对齐,提高了语句对齐的灵活性。
可选地,所述根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,还包括:
当所述测试语音的终止时域位置在所述原始语音的终止时域位置之前时,在所述测试语音的终止时域位置之后添加第三静默语句,所述第三静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差。
可选地,所述根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,还包括:
当所述测试语音的终止时域位置在所述原始语音的终止时域位置之后时,在所述测试语音的终止时域位置之后删除第四静默语句,所述第四静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差。
在对原始语音的各个语句和测试语音的各个语句执行对齐处理后,由于可能在测试语句内部执行了对齐处理,导致测试语句的时长发生变化,因此,当原始语音的各个语句与测试语音的各个语句完成对齐后,原始语音和测试语音的终止时域位置可能会出现时间偏差,上述方法可以对齐原始语音和测试语音的终止时域位置。
可选地,根据缺失检测和/或断续检测的结果对齐测试语音和原始语音之前,所述方法还包括:
根据预设的异常语音检测模型检测原始语音和测试语音,确定测试语音是否属于异常语音,该预设的异常语音检测模型为非机器学习模型,该非机器学习模型检测的内容与所述缺失检测所检测的内容相异,和/或,该非机器学习模型检测的内容与所述断续检测所检测的内容相异。
可选地,所述方法还包括:
根据机器学习模型和对齐后的原始语音检测对齐后的测试语音,确定对齐后的测试语音是否属于异常语音,或,确定对齐后的测试语音的异常类型。
预设的异常语音检测模型通常是一些常见的异常语音的检测模型,这些检测模型具有较强的针对性,能够快速检测一种或多种常见的异常语音,但是,预设的异常语音检测模型无法检测不常见的异常语音,并且,预设的语音检测模型可能会漏检常见异常语音的情况。根据本实施例提供的方案,首先利用预设的异常语音检测模型检测常见的异常语音,再利用机器学习模型检测不常见的异常语音,同时,利用机器学习模型再次检测常见的异常语音,从而能够提高异常语音检测的成功率。
第二方面,提供了一种确定对齐语音的装置,该装置可以实现上述第一方面所涉及的方法中各个步骤所对应的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的单元或模块。
在一种可能的设计中,该装置包括处理器和通信接口,该处理器被配置为支持该装置执行上述第一方面所涉及的方法中相应的功能。该通信接口用于支持该装置与其它网元之间的通信。该装置还可以包括存储器,该存储器用于与处理器耦合,其保存该装置必要的程序指令和数据。
第三方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储了计算机程序代码,该计算机程序代码被处理单元或处理器执行时,使得对齐语音的装置执行第一方面所述的方法。
第四方面,提供了一种芯片,其中存储有指令,当其在对齐语音的装置上运行时,使得该芯片执行上述第一方面的方法。
第五方面,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码被对齐语音的装置的通信单元或通信接口、以及处理单元或处理器运行时,使得对齐语音的装置执行上述第一方面的方法。
附图说明
图1是本申请提供的一种异常语音识别系统的示意性结构图;
图2是本申请提供的一种对齐语音的方法的示意性流程图;
图3是本申请提供的一种对齐测试语句和原始语句的方法的示意图;
图4是本申请提供的另一种对齐测试语句和原始语句的方法的示意图;
图5是本申请提供的再一种对齐测试语句和原始语句的方法的示意图;
图6是本申请提供的再一种对齐测试语句和原始语句的方法的示意图;
图7是本申请提供的一种语句划分方法的示意图;
图8是本申请提供的一种子语句划分方法的示意图;
图9是本申请提供的一种异常语音识别方法的示意性流程图;
图10是本申请提供的一种异常语音检测模块的工作流程示意图;
图11是本申请提供的一种内容缺失异常的示意图;
图12是本申请提供的另一种内容缺失异常的示意图;
图13是本申请提供的再一种内容缺失异常的示意图;
图14是本申请提供的一种断续异常的示意图;
图15是本申请提供的一种语音预处理模块的工作流程示意图;
图16是本申请提供的一种异常语音识别模块的示意性结构图;
图17是本申请提供的一种基于机器学习模型的训练流程的示意图;
图18是本申请提供的一种基于机器学习模型的检测流程的示意图;
图19是本申请提供的另一种异常语音识别方法的示意性流程图;
图20是本申请提供的一种待检测语音的示意图;
图21是本申请提供的一种语句划分结果的示意图;
图22是本申请提供的一种语句对齐结果的示意图;
图23是本申请提供的另一种待检测语音的示意图;
图24是本申请提供的一种对齐语音的装置的示意性结构图;
图25是本申请提供的另一种对齐语音的装置的示意性结构图。
具体实施方式
异常语音指的是用户在通话过程中主观感知到语音质量差的现象。常见的异常语音包括至少一项以下现象:
静音:在通话过程中存在有至少一方听不到对方声音。
断续:在通话过程中接听方能听到对方的声音,但声音时断时续。
小能量:在传输过程中语音能量损失过大,导致通话过程中接听方能听到对方的声音,但声音非常小。
杂音:在通话过程中正常语音掺杂有干扰音,如金属声、水流声等,导致用户听觉不适。
低质量:在通话过程中出现语音内容丢失、语音失真或回声的现象,导致用户听觉不适。
下面将结合附图,对本申请中的技术方案进行描述。
图1示出了一种适用于本申请的异常语音识别系统的示意图。该系统100包括:
语音输入模块110,用于转换输入语音的采样率。当输入的原始语音的采样率与测试语音的采样率不同时,语音输入模块110可以将该原始语音的采样率与该测试语音的采样率转换为相同的采样率,该测试语音是该原始语音经过通信网络传输后得到的语音,例如,测试语音采用16K采样率,原始语音采用8K采样率,语音输入模块110可以将测试语音的采样率降至8K后再讲原始语音和测试语音输入异常语音检测模块121。
异常语音检测模块121,基于非机器学习模型检测测试语音是否存在异常情况以及异常情况的具体类型,该非机器学习模型例如是声学回声检测模型、断续检测模型和背景噪声检测模型,该异常情况例如是低质量、断续和杂音。
语音预处理模块122,用于对齐测试语音和原始语音,以便于进行后续的异常语音检测。在本申请中,对齐语音指的是对齐两段语音的起始时域位置和终止时域位置,由于经过对齐处理的两段语音的各个语音片段都是一一对应的,因此,经过对齐处理的两段语音在进行异常语音检测时更加容易识别出异常语音。
异常语音识别模块123,基于机器学习模型检测测试语音是否在异常情况以及异常情况的具体类型,该机器学习模型例如是随机森林模型和深度神经网络模型。
上述异常语音检测模块121、语音预处理模块122和异常语音识别模块123可以是独立的模块,也可以是集成在一起的模块,例如,异常语音检测模块121、语音预处理模块122和异常语音识别模块123集成在核心异常语音识别设备120中。
合并输出模块130,将异常语音检测模块121和异常语音识别模块123检测异常语音的结果做合并处理并输出,该合并处理指的是将异常语音检测模块121和异常语音识别模块123检测得到的两个相同的结果合并为一个结果,例如,异常语音检测模块121和异常语音识别模块123均检测出测试语音存在杂音,则合并输出模块130将异常语音检测模块121和异常语音识别模块123检测得到的两个语音异常(存在杂音)合并后仅输出一个语音异常;又例如,异常语音检测模块121检测出测试语音存在断续,异常语音识别模块123识别出测试语音存在杂音,则合并输出模块130可以输出两个语音异常,即,测试语音存在断续和杂音。
系统100仅是适用于本申请的异常语音识别系统的一个例子,适用于本申请的异常语音识别系统还可以具有比系统100更多或者更少的模块,例如,适用于本申请的异常语音识别系统还可以包括显示模块,或者,适用于本申请的异常语音识别系统还可以不包括合并输出模块130。
下面,将基于图1所示的异常语音识别系统100为例对本申请提供的异常语音识别方法进行详细描述。
图2示出了本申请提供的一种对齐语音的方法的示意性流程图,该方法200包括:
S210,获取原始语音和测试语音,该测试语音为原始语音经过通信网络传输后生成的语音。
S210可以由异常语音检测模块121或异常语音识别模块123执行,上述获取原始语音和测试语音可以是接收来自语音输入模块110的原始语音和测试语音,其中,来自语音处理模块110的原始语音和测试语音可以是采样率相同的语音。作为一个可选的示例,上述获取原始语音和测试语音也可以是通过其它模块获取采样率不同的语音。
S220,对测试语音执行缺失检测和/或断续检测,其中,缺失检测用于确定测试语音相对于原始语音的是否存在语音缺失,断续检测用于确定测试语音相对于原始语音是否存在语音断续。
语音缺失属于上述低质量异常语音的一种,对测试语音执行缺失检测和/或断续检测的方法可以由异常语音检测模块121或异常语音识别模块123执行,具体的检测方法可以参考现有技术中的缺失检测方法和断续检测方法,为了简洁,在此不再赘述。
S230,根据缺失检测和/或断续检测的结果对齐测试语音和原始语音,得到对齐后的测试语音和对齐后的原始语音,其中,缺失检测和/或断续检测的结果用于指示对齐测试语音和原始语音的方式。
S230可以由语音预处理模块122执行。例如,若上述检测结果为测试语音不存在语音缺失和较长的语音断续,则可以将原始语音所包含的各个语句以及测试语音所包含的各个语句依次对齐,其中,上述不存在较长的语音断续指的是测试语音的语句相对于原始语音的语句的时延小于时延阈值;又例如,若上述检测结果为测试语音存在语音缺失了第一个语句,则可以将原始语音所包含的除第一个语句之外的各个语句以及测试语音所包含的各个语句依次对齐;再例如,若上述检测结果为测试语音不存在语音缺失但存在较长的语音断续,即,测试语音的语句相对于原始语音的语句的时延大于时延阈值,为了进一步确定测试语音是否存在其它异常现象,可以增大时延阈值,以便于进行进一步的异常检测,也可以不再对齐测试语音和原始语音,直接输出异常结果。
因此,本申请提供的对齐语音的方法,根据缺失检测和/或断续检测的结果确定对齐语音的方法,可以根据测试语音的具体情况使用最合适的方法进行语音对齐,从而提高了对齐语音的效率。
需要说明的是,即使执行S220时使用的原始语音和测试语音为采样率不同的语音,为了确保对齐结果的精确度(或者称为“对齐结果的准确性”),在对测试语音和原始语音执行对齐处理时仍需要将测试语音和原始语音的采样率转变为相同的采样率。
应理解,方法200可以通过运行于通用处理器上的程序代码实现,也可以通过专用硬件设备实现,还可以通过软硬结合(程序代码与专用硬件设备结合)的方式实现。
可选地,原始语音包括第一原始语句,测试语音包括第一测试语句,第一原始语句与所述第一测试语句对应,S230包括:
当测试语音不存在语音缺失和/或语音断续时,且当第一测试语句的起始时域位置在第一原始语句的起始时域位置之前时,在第一测试语句的起始时域位置之前插入第一静默语句,使得所述第一测试语句的起始时域位置对齐所述第一原始语句的起始时域位置,第一静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差;或者,
当测试语音不存在语音缺失和/或语音断续时,且当第一测试语句的起始时域位置在第一原始语句的起始时域位置之后时,在第一测试语句的起始时域位置之前删除第二静默语句,第二静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差。
一段语音可以被划分为多个语句,每个语句即多个振幅值超过预设振幅阈值的帧的集合,任意两个相邻的语句之间存在一段静默期,静默期可以是未检测到语音活动的音频片 段,静默期也可以是至少一个振幅值小于预设振幅阈值的帧的集合,例如,静默期为两句话中间的停顿。
第一原始语句为原始语音中的任意一个语句,原始语音可以仅包含第一原始语句,也可以包含除第一原始语句之外的语句,相应地,测试语音可以仅包含第一测试语句,也可以包含除第一测试语句之外的语句。
当测试语音不存在语音缺失和/或语音断续时,按照本实施例提供的方案,首先对齐原始语音和测试语音的起始时域位置,当第一测试语句的起始时域位置比第一原始语句的起始时域位置提前时,在第一测试语句前插入一段静默语音,即,第一静默语句,该第一静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差;当第一测试语句的起始时域位置比第一原始语句的起始时域位置靠后时,在第一测试语句前删除一段静默语音,即,第二静默语句,该第二静默语句的时长等于第一测试语句的起始时域位置与第一原始语句的起始时域位置的时间差,从而对齐了原始语音的各个语句和测试语音的各个语句。
需要说明的是,上述“插入”指的是在第一测试语句的起始时域位置之前的任一时域位置加入一段静默语音,使得第一测试语句沿时间轴方向整体推移一段距离,例如,原始语音和测试语音的起始时域位置均为0秒(s),即,原始语音和测试语音的起始时域位置处于对齐状态,第一原始语句的起始时域位置为10s,第一测试语句的起始时域位置为5s,即,第一测试语句的起始时域位置在第一原始语句的起始时域位置之前,此时,可以在测试语音的0~5s的时域位置中的任意一点插入一段静默语音(即,第一静默语句),使得第一测试语句沿时间轴方向整体向后移动5s,从而对齐了第一测试语句的起始时域位置和所述第一原始语句的起始时域位置。
图3至图6分别示出了本申请提供的几种对齐测试语句和原始语句的方法。
如图3所示,第一测试语句在测试语音中的位置(第一位置)相对于第一原始语句在原始语音中的位置(第二位置)延后了一段时长,为了对齐第一原始语句和第一测试语句,可以在第一测试语句前删除一段静默语音,删除的这段静默语音称为第二静默语句,第二静默语句的时长等于第一位置相对于第二位置延后的时长,从而对齐了第一原始语句和第一测试语句。
需要说明的是,在本申请中,为了描述简洁,有时会将“时域位置”简称为“位置”。
如图4所示,第一测试语句在测试语音中的位置(第一位置)相对于第一原始语句在原始语音中的位置(第二位置)提前了一段时长,为了对齐第一原始语句和第一测试语句,可以在第一测试语句前添加一段静默语音,添加的这段静默语音称为第一静默语句,第一静默语句的时长等于第一位置相对于第二位置提前的时长,从而对齐了第一原始语句和第一测试语句。
如图5所示,第一测试语句为测试语音的最后一个语句,第一原始语句为原始语音的最后一个语句,第一测试语句与第一原始语句的起始时域位置已对齐,由于第一测试语句内部做了对齐处理(例如,对齐第一测试语句的子语句和第一原始语句的子语句),导致第一测试语句的时长变长,因此,测试语音的终止位置位于原始语音的终止位置之后,为了对齐测试语音和原始语音,可以在测试语音的终止位置之前删除一段静默语音,删除的这段静默语音称为第四静默语句,第四静默语句的时长等于测试语音的终止位置相对于原 始语音的终止位置延后的时长,从而对齐了原始语音和测试语音。
如图6所示,第一测试语句为测试语音的最后一个语句,第一原始语句为原始语音的最后一个语句,第一测试语句与第一原始语句的起始时域位置已对齐,由于第一测试语句内部做了对齐处理(例如,对齐第一测试语句的子语句和第一原始语句的子语句),导致第一测试语句的时长变短,因此,测试语音的终止位置位于原始语音的终止位置提前,为了对齐测试语音和原始语音,可以在测试语音的终止位置之后添加一段静默语音,添加的这段静默语音称为第三静默语句,第三静默语句的时长等于测试语音的终止位置相对于原始语音的终止位置提前的时长,从而对齐了原始语音和测试语音。
可选地,在第一测试语句的起始时域位置之前添加第一静默语句之前,或者,在第一测试语句的起始时域位置之前删除第二静默语句之前,S230还包括:
根据原始语音中的静默期确定至少两个原始语句,该至少两个原始语句包括第一原始语句,其中,原始语音中的静默期用于指示至少两个原始语句的划分位置。
根据测试语音中的静默期确定至少两个测试语句,该至少两个测试语句包括第一测试语句,其中,测试语音中的静默期用于指示至少两个测试语句的划分位置。
静默期也可以称为静默语句或者静默语音,指的是未检测到语音活动的音频片段,或者,至少一个振幅值小于预设振幅阈值的帧的集合,例如两句话之间的停顿时段对应的音频片段,根据本实施例提供的技术方案,可以仅对测试语音中的测试语句(或者具有语音活动的音频片段)进行异常语音识别处理,不再对测试语音中的静默语句(或者静默期)进行异常语音识别处理,例如,测试语音和原始语音的时长均为10秒,并且,测试语音的起始位置和原始语音的起始位置分别存在一段时长为1秒的静默期,则对齐语音装置可以从测试语音的第1秒的位置开始对测试语音进行异常语音识别处理,不再对测试语音的第0秒至第1秒的部分进行异常语音识别处理,从而减小了识别异常语音的工作量,提高了识别异常语音的效率。
图7示出了本申请提供的一种语句划分的方法的示意图。
图7中,横轴表示时间,纵轴表示振幅,可以根据语音活动检测算法(voice activity detection,VAD)算法为一段语音划分语句,VAD算法可以做如下设定:当一个语音片段包含至少300ms的连续语音活动时,可将该连续语音活动划分为一个片段。图7所示的语音包含3个连续语音活动时间超过300ms的片段,可以将该语音划分为3个片段。
片段划分完成后,可以做如下处理:
若片段S i的终点与片段S i+1的起点的时间间隔小于静默期阈值(例如,200ms),则片段S i与S i+1可合并为一个语句;
若片段S i终点与片段S i+1起点间的间隔不小于静默期阈值,则将片段S i与S i+1划分为两个语句;
若片段S i之后再没有其他片段,得到最后一个包含S i的语句后结束。
从而完成了语句的划分。
可选地,在第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在第一测试语句的起始时域位置之前添删除第二静默语句之前,S230还包括:
根据第一测试语句的波谷确定第一子测试语句和第二子测试语句,其中,该波谷为第一测试语句中帧的振幅均值小于或等于振幅阈值的语音片段,该波谷用于指示第一子测试 语句和第二子测试语句的划分位置;
根据互相关系数和第一子测试语句确定第一子原始语句,互相关系数用于指示第一原始语句的语音片段与第一子测试语句的相似度,第一子原始语句为第一原始语句的语音片段中与第一子测试语句的相似度最高的语音片段;
根据第一时域位置相对于第二时域位置的时间偏移量并且以第一子原始语句的时域位置为参照位置将第一子测试语句与第一子原始语句对齐,其中,第一时域位置为第一子测试语句在第一测试语句中的时域位置,第二时域位置为第一子原始语句在第一原始语句中的时域位置。
波谷可能是一句话中间的短暂停顿,因此,可以依据波谷将第一测试语句划分为至少两个子测试语句,并对齐该两个子测试语句,从而可以使得第一测试语句与第一原始语句的对齐结果更加精确,有利于提高后续异常语音识别的准确度。
可选地,根据第一时域位置相对于第二时域位置的时间偏移量并且以第一子原始语句的时域位置为参照位置将第一子测试语句与第一子原始语句对齐,包括:
当时间偏移量小于或等于时延阈值时,根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将第一子测试语句与第一子原始语句对齐。
当时间偏移量大于时延阈值时,说明第一子测试语句相对于第一子原始语句的时延较大,该时延很可能是由于语音缺失或者较长的语音断续导致的,可以不再对第一子测试语句执行对齐处理,直接输出异常结果;当时间偏移量小于时延阈值时,说明第一子测试语句相对于第一子原始语句的时延较小,该时延可能是较短的语音断续导致的,或者,该时延可能是通信网络传输导致的正常时延,可以对第一子测试语句执行对齐处理,以便于进一步确定第一子测试语句是否还存在其它异常现象,上述方法可以根据实际情况确定是否进行语句内对齐,提高了语句对齐的灵活性。
图8示出了本申请提供的一种子语句划分的方法的示意图。
将第一测试语句按照帧长为20ms、帧移为10ms划分为若干帧,计算每帧内语音波形的振幅均值。若某一帧的振幅均值小于200,则将该帧视为波谷。以波谷为分界点,将第一测试语句划分为若干子语句(即,子测试语句)。
基于图8划分的第一测试语句的子语句,计算第一原始语句y中与子测试语句x i(x i为第一测试语句中的任一子测试语句)对应的语音片段,并计算x i相对于该语音片段的时延τ i,计算方法如下:
Figure PCTCN2019088591-appb-000001
Figure PCTCN2019088591-appb-000002
Figure PCTCN2019088591-appb-000003
其中,corr(x i,y)是利用互相关系数计算得到的子测试语句x i在第一原始语句y中的最相似位置,即,与子测试语句x i对应的语音片段在第一原始语句y中的位置,
Figure PCTCN2019088591-appb-000004
是子测试语句x i在第一测试语句中的偏移量,n是计算互相关系数时的偏差,N是子测试语句x i的时间长度,M是第一原始语句y的时间长度,若τ i大于预先设置的异常时延阈值(也 可简称为“时延阈值”),则说明x i相对于第一原始语句中的子原始语句的时延较大,对该时延很可能是由于语音缺失或者较长时间的语音断续导致的,可以不再对x i执行对齐处理,或者,为了确定x i是否存在其它异常,可以增大异常时延阈值,以便于进一步对x i进行异常检测。若τ i小于预先设置的异常时延阈值,则说明x i相对于第一原始语句中的子原始语句的时延较小,该时延可能是较短的语音断续导致的时延或者通信网络传输导致的正常时延,可以对x i执行对齐处理,以便于进一步确定x i是否还存在其它异常现象。
可选地,S230还包括:
当所述测试语音的终止时域位置在所述原始语音的终止时域位置之前时,在所述测试语音的终止时域位置之后添加第三静默语句,所述第三静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差;或者,
当所述测试语音的终止时域位置在所述原始语音的终止时域位置之后时,在所述测试语音的终止时域位置之后删除第四静默语句,所述第四静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差。
在对原始语音的各个语句和测试语音的各个语句执行对齐处理后,由于可能在测试语句内部执行了对齐处理,导致测试语句的时长发生变化,因此,当原始语音的各个语句与测试语音的各个语句完成对齐后,原始语音和测试语音的终止时域位置可能会出现时间偏差,上述方法可以对齐原始语音和测试语音的终止时域位置。
可选地,根据缺失检测和/或断续检测的结果对齐测试语音和原始语音之前,方法200还包括:
根据预设的异常语音检测模型检测原始语音和测试语音,确定测试语音是否属于异常语音,该预设的异常语音检测模型为非机器学习模型,该非机器学习模型检测的内容与所述缺失检测所检测的内容相异,和/或,该非机器学习模型检测的内容与所述断续检测所检测的内容相异。
若S220执行了缺失检测,则上述步骤不再执行缺失检测;若S220执行了断续检测,则上述步骤不再执行断续检测;若S220执行了缺失检测和断续检测,则上述步骤不再执行缺失检测和断续检测。从而避免了与重复检测,提高了检测效率。
上述步骤可以在S230之前的任一时刻执行。
由于预设的异常语音检测模型通常是一些常见的异常语音的检测模型(非机器学习模型),这些检测模型具有较强的针对性,能够快速检测一种或多种常见的异常语音,因此,上述步骤可以快速确定测试语音是否存在常见的异常现象。
可选地,方法200还包括:
根据机器学习模型和对齐后的原始语音检测对齐后的测试语音,确定对齐后的测试语音是否属于异常语音,或,确定对齐后的测试语音的异常类型。
由于预设的异常语音检测模型无法检测不常见的异常语音,并且,预设的语音检测模型可能会漏检常见异常语音的情况。根据本实施例提供的方案,首先利用预设的异常语音检测模型检测常见的异常语音,再利用机器学习模型对测试语音进检测,确定测试语音是否存在未知异常现象,和/或,确定测试语音是否存在非机器学习模型漏检的异常现象,从而能够提高测试语音异常现象的检出概率。
下面,基于上文描述的本申请涉及的共性特征,对本申请实施例做进一步详细说明。
图9是本申请提供的一种异常语音识别方法的示意性流程图。
用户输入的一对语音(原始语音和测试语音)首先会经过语音输入模块110的转换,转换后的两段语音(原始语音和测试语音)传给异常语音检测模块121,该模块判断测试语音是否存在静音、小能量等异常问题,若检测出异常,则将检测出异常的结果传给合并输出模块130作为最终的异常识别结果,若未检测出异常,则将未检测出异常的语音传给语音预处理模块122。
传入语音预处理模块122的语音在完成信号预处理和语句划分工作后,会被传入异常语音检测模块121。异常语音检测模块121针对第二次传入的两段语音,判断测试语音是否存在语句缺失、断续等异常问题,若检测出异常,则将检测出异常的结果传给合并输出模块130,作为最终的异常识别结果,若未检测出异常,则将未检测出异常的语音传给语音预处理模块122,语音预处理模块122针对第二次传入的语音进行时间上的对齐,并将对齐后的两段语音传给异常语音识别模块123,进行进一步的异常识别,然后将识别结果输出到合并输出模块130。最后,合并输出模块130合并异常语音检测模块121与异常语音识别模块123的结果,作为该对语音的最终检测结果。
异常语音检测模块121执行异常检测的流程如图10所示。图10所示的方法的步骤包括:
静音判断1001:利用VAD算法对输入的两段语音分别进行滑窗检测,记录各语音片段的端点。若算法在测试语音上未检测到语音活动,而在原始语音上检测到语音活动,则认为该测试语音存在静音异常,并将静音异常作为异常检测结果传入合并输出模块130,否则,执行小能量判断1002。
小能量判断1002:若在上一步静音判断中未检测到静音异常,则在该步中分别计算原始语音和测试语音的响度值。将测试语音较原始语音间的响度损耗值(测试语音响度-原始语音响度)输入小能量异常判断的分类器A中,若分类器A的分类结果是异常,则认为该测试语音存在小能量异常,并将小能量异常作为异常检测结果传入合并输出模块130,否则,将该组语音传入语音预处理模块122。
语句缺失判断1003:语音预处理模块122完成语句划分的信号预处理后,语音预处理模块122将处理结果传入异常语音检测模块121,并进行语句缺失的异常判断。在经过语音预处理后,两段语音都已按照语音活动被划分为若干语句,并得到基于静默期的语句划分结果。比较原始语音划分的语句数量(Utt ref)和测试语音划分的语句数量(Utt de),若Utt ref≠Utt de,则异常语音检测模块121判断该测试语句出现内容缺失异常,若Utt ref=Utt de,但测试语句与原始语句的长度之比小于0.9,异常语音检测模块121也判断该测试语句存在内容缺失异常,并将内容缺失异常作为异常检测结果传入合并输出模块130,否则,异常语音检测模块121执行断续判断1004。
内容缺失异常的示例如图11至图13所示,图11中,由于测试语音中第二个语句内出现了较长时间的内容缺失,导致出现Utt ref<Utt de的情况,其中,左侧为原始语音,右侧为测试语音,Utt ref=2,Utt de=3。图12中,由于测试语音丢失了原始语音中的第二个语句,导致出现Utt ref>Utt de的情况,其中,左侧为原始语音,右侧为测试语音,Utt ref=2,Utt de=1。图13中,由于测试语音丢失了第二个语句的开头内容,导致该语句在测试语音中的长度l de远小于其在原始语音中的长度l ref,l ref/l de<0.9,其中,左侧为原始语音,右侧为测试语音, Utt ref=2,Utt de=2。
断续判断1004:若在上一步语句缺失判断中未检测到异常,则在该步骤中判断测试语音是否存在断续问题。利用语句划分过程中所记录的语音片段端点信息,分别计算原始语音和测试语音中各语句内的静默期时长,将测试语句与原始语句间的静默期时长差值输入断续异常判断的分类器B中,若分类器B的分类结果为异常,则异常语音检测模块121判断该测试语音存在断续异常,并将断续异常作为异常检测结果传入合并输出模块130,否则,将该组语音再次传入语音预处理模块122。
图14为基于静默期判断断续异常的示例,假设在测试语音的第一个语句中,语音片段S11和S12间的静默期长度为len 1,第二个语句中片段S21与S22间的静默期长度为len 2,而在原始语音对应的语句中,静默期长度分别为len 1'和len' 2,在将len 1-len 1'与len 2-len' 2分别输入到分类器B中,由于分类器B将len 2与len' 2的差值检测为异常,则该测试语音内存在断续异常问题。
上述异常检验中所用到的分类器A和B可以基于训练数据集通过机器学习的方法获得。
分类器A的训练方案如下:
挑选出训练数据集中所有的正常样本对与小能量异常样本对,得到用于训练分类器A的子训练数据集;
计算子训练数据集内各样本对间的响度差值(测试语音响度减去原始语音响度);
采用标准的训练方法,利用各响度差值和其对应的样本标注进行分类器A的训练,得到分类器参数。
分类器B的训练方案:
挑选出训练数据集中所有的正常样本对和断续异常样本对,将所有挑选出的样本对依次输入语音预处理模块中完成语句划分的语音预处理工作,得到所有样本对内的语句对;
将正常样本对划分出的语句对全部标注为正常,将断续异常样本对划分出的语句对进行重新标注,只有存在断续情况的语句对标注为异常,其余情况标注为正常,得到用于训练分类器B的子训练数据集;
计算子训练数据集内各语句对间的静默期时长差值(测试语句的静默期长度减去原始语句的静默期长度);
采用标准的训练方法,利用各静默期时长差值和其对应的标注进行分类器B的训练,得到分类器参数。
下面对语音预处理模块122的工作流程进行详细描述。
图15示出了本申请提供的一种语音预处理模块122的示意性工作流程图。该工作流程所包含的步骤如下。
信号预处理1501:为减少不同语音系统间的系统增益差别,突出对听觉感知重要的频率成分,语音预处理模块122会将两段语音的电平调整至标准的听觉电平,并利用带通滤波器进行滤波。
语句划分1502:基于异常语音检测模块中静音判断时所记录的语音片段端点,分别对原始语音和测试语音进行语句划分,并将语句划分结果传入语音异常检测模块121。语句划分的示例可参考图7所示的方法。
时间对齐1503:当测试语音和原始语音再次进入语音预处理模块时,说明测试语音已通过了语句缺失与语句断续的异常检测,该测试语音内未检测到语句缺失和语句断续问题。因此,可以确定测试语音中的语句与原始语音中的语句是一一对应的,可以对测试语句执行语句内对齐处理。
基于图8所示的方法将测试语音的各个语句划分为不同的子测试语句,若τ i大于0,说明子测试语句x i在第一测试语句中的时域位置比对应的语音片段在第一原始语句中的时域位置更加靠后,可以以x i的开始点(起始时域位置)为起点,向前移除一段波谷(该段波谷的时间长度等于τ i的值)。若τ i小于0,说明子测试语句x i在第一测试语句中的时域位置比对应的语音片段在第一原始语句中的时域位置更加靠前,以x i的开始点(起始时域位置)为起点,向后插入一段静默期片段(该静默期片段的时间长度等于τ i的绝对值)。若τ i等于0,说明子测试语句x i在第一测试语句中的时域位置与对应的语音片段在第一原始语句中的时域位置相同,无需做对齐处理。
对第一测试语句执行语句内对齐处理后,可以将测试语音中未进行语句内对齐处理的第一测试语句替换为进行语句内对齐处理后的第一测试语句,并基于图3至图5所示的方法对齐第一测试语句和第一原始语句。
图16示出了本申请提供的一种异常语音识别模块123的示意性结构图。异常语音识别模块123基于机器学习模型对原始语音和测试语音进行异常检测。异常语音识别模块123进行异常语音检测的流程包括训练流程和检测流程,其中,训练流程为可选的流程,异常语音识别模块123可以使用训练好的模型执行检测流程。
训练流程如下所示。
特征提取:为了描述测试语音和原始语音间的差异性,异常语音识别模块123首先以帧为单位分别在两段语音上提取语音特征参数,该语音特征包括但不仅限于表1中所示的语音特征;然后异常语音识别模块123计算两段语音内各组特征参数的差值,例如,原始语音的Mel倒谱系数(Mel-frequency cepstral coefficients,MFCC)与测试语音的MFCC的差值;最后基于整段语音(原始语音和测试语音)的特征差值,利用包括但不仅限于表2中的统计方法提取整段语音的各组特征参数的统计特征,得到一对语音间维数固定的差异特征。
表1
Figure PCTCN2019088591-appb-000005
Figure PCTCN2019088591-appb-000006
表2
Figure PCTCN2019088591-appb-000007
Figure PCTCN2019088591-appb-000008
异常识别:利用机器学习模型(随机森林、深度神经网络等)学习测试语音在何种差异情况下属于异常语音,以及属于具体哪种异常语音类型。异常语音类型不限于分为静音、小能量、断续、杂音、低质量五个大类,还可细分为静音、小能量、断续、金属音、流水音、内容缺失、回声、失真等更具体的类型。
基于机器学习模型的训练流程如图17所示,在对T个训练样本完成特征提取后,所有得到的差异描述特征将和其各自对应的异常标注(无异常或具体的异常类型)一同输入到多分类的机器学习模型中,得到学习后的异常识别模型。其中,异常识别模型主要包含了差异描述特征x和标注y间的映射关系。
基于机器学习模型的检测流程如图18所示。
首先输入一对语音提取其差异特征。
基于上述机器学习模型计算该对语音属于各类异常的概率(或得分),其中,上述机器学习模型包括各类异常与差异特征的对应关系。
将概率(或得分)最高的异常类型认定为异常分类结果。其中,若各类异常的概率(或得分)的值不满足预设条件,则可认为该对语音中的测试语音属于正常语音。
图19是本申请提供的另一异常语音识别方法的示意性流程图。
首先输入如图20所示的一对待测语音(均为8K采样率),然后进行异常语音检测模块121的第一部分检测,即,执行静音判断1001和小能量判断1002,以排除测试语音中可能存在静音异常和小能量异常。随后利用语音预处理模块122执行信号预处理1501和语句划分1502,得到图21所示的结果,表3为语句划分的结果。
表3
Figure PCTCN2019088591-appb-000009
测试语音和原始语音再次进入异常语音检测模块121进行第二部分异常检验,即,异常语音检测模块121执行缺失判断1003和断续判断1004,基于语句划分结果,Utt ref和Utt de均为2,l ref/l de>0.9,测试语音的两个语句均未检测到静默期,排除测试语音中可能存在的内容缺失和断续问题,随后异常语音检测模块121将测试语音和原始语音传入语音预处理模块122内进行进一步处理。
经过异常语音检测模块121的检测,上述测试语音不存在内容缺失和断续问题。语音预处理模块122依次对各测试语句执行子语句划分、子语句的时延计算和语句内的对齐,并利用对齐后的各测试语句完成语句间的对齐,对齐后的结果如图22所示。
特征提取器1231提取测试语音和原始语音的差异特征,异常识别器1232对差异特征进行分类,上述示例中的测试语音被识别为存在流水声的异常语音,异常识别器1232将 该结果传给合并输出模块130。
合并输出模块130将异常语音识别模块123的输出结果作为最终的输出显示给用户:
“该测试语音是异常语音,存在杂音(流水声)问题。”
在本实施例中,测试语音较原始语音存在明显的时延问题,其中掺杂的杂音对波形的影响不明显,因此,基于非机器学习模型进行检测的步骤未能识别出异常,通过时间对齐1503,测试语音中的各个语句和子语句与原始语音中的片段能够快速被对齐,此时通过基于机器学习模型的异常检测模型检测出测试语音的异常,从而提高了异常检测效率。
基于图19所示的流程,另一示例如下所述,
首先输入如图23所示的一对待测语音(均为8K采样率),然后进行异常语音检测模块121的第一部分检测,即,执行静音判断1001和小能量判断1002,以排除测试语音中可能存在静音异常和小能量异常。随后利用语音预处理模块122执行信号预处理1501和语句划分1502。
测试语音和原始语音再次进入异常语音检测模块121进行第二部分异常检验,即,异常语音检测模块121执行缺失判断1003和断续判断1004,基于语句划分结果,Utt ref和Utt de均相等,且l ref/l de>0.9,则排除测试语音中可能存在的内容缺失,若测试语句与原始语句间的静默期时长差值大于预先设置的断续阈值T d,则认为测试语音中存在断续异常问题,并将异常结果直接传入合并输出模块130内进行进一步处理。
合并输出模块130将异常语音检测模块121的输出结果作为最终的输出显示给用户:
“该测试语音是异常语音,存在断续问题。”
在本实施例中,测试语音内存在断续问题,基于非机器学习模型的检测方法无需进行训练即可检测异常,从而提高了语音异常检测的效率。上述实施例仅是举例说明,合并输出模块130也可以缓存异常语音检测模块121的检测结果,等待异常语音识别模块123的检测结果,并将二者的检测结果合并后输出,从而可以更全面的检测测试语音存在的异常现象。
上文详细介绍了本申请提供的对齐语音的方法的示例。可以理解的是,对齐语音的装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请可以根据上述方法示例对对齐语音的装置进行功能单元的划分,例如,可以按照图2所示的方式对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用集成的单元的情况下,图24示出了上述实施例中所涉及的对齐语音的装置的一种可能的结构示意图。对齐语音的装置2400包括:获取单元2401、检测单元2402和对齐单元2403。检测单元2402和对齐单元2403用于支持对齐语音的装置2400执行图2 所示的检测和对齐步骤。获取单元2401于获取原始语音和测试语音。获取单元2401、检测单元2402和对齐单元2403还可以用于执行本文所描述的技术的其它过程。对齐语音的装置2400还可以包括存储单元,用于存储对齐语音的装置2400的程序代码和数据。
获取单元2401用于:获取原始语音和测试语音,该测试语音为该原始语音经过通信网络传输后生成的语音;
检测单元2402用于:对所述获取单元2401获取的所述测试语音执行缺失检测和/或断续检测,缺失检测用于确定测试语音相对于原始语音是否存在语音缺失,断续检测用于确定测试语音相对于原始语音是否存在语音断续;
对齐单元2403用于:根据检测单元2402的缺失检测和/或断续检测的结果对齐测试语音和原始语音,得到对齐后的测试语音和对齐后的原始语音,其中,缺失检测和/或断续检测的结果用于指示对齐测试语音和原始语音的方式。
检测单元2402和对齐单元2403可以是处理单元的组成部分,处理单元可以是处理器或控制器,例如可以是中央处理器(central processing unit,CPU),通用处理器,数字信号处理器(digital signal processor,DSP),专用集成电路(application-specific integrated circuit,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。获取单元2401可以是收发器或通信接口。存储单元可以是存储器。
当处理单元为处理器,获取单元2401为通信接口,存储单元为存储器时,本申请所涉及的对齐语音的装置可以为图25所示的装置。
参阅图25所示,该装置2500包括:处理器2502、通信接口2501和存储器2503。其中,通信接口2501、处理器2502以及存储器2503可以通过内部连接通路相互通信,传递控制和/或数据信号。
本领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请提供的对齐语音的装置2400和对齐语音的装置2500,根据缺失检测和/或断续检测的结果确定对齐语音的方法,可以根据测试语音的具体情况使用最合适的方法进行语音对齐,从而提高了对齐语音的效率。
装置实施例和方法实施例中完全对应,由相应的模块执行相应的步骤,例如获取单元执行方法实施例中的获取步骤,除获取步骤以外的其它步骤可以由处理单元或处理器执行。具体单元的功能可以参考相应的方法实施例,不再详述。
在本申请各个实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施过程构成任何限定。
另外,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可 以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read only memory,ROM)、可擦除可编程只读存储器(erasable programmable ROM,EPROM)、电可擦可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于对齐语音的装置中。当然,处理器和存储介质也可以作为分立组件存在于对齐语音的装置中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,数字通用光盘(digital versatile disc,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (18)

  1. 一种对齐语音的方法,其特征在于,包括:
    获取原始语音和测试语音,所述测试语音为所述原始语音经过通信网络传输后生成的语音;
    对所述测试语音执行缺失检测和/或断续检测,所述缺失检测用于确定所述测试语音相对于所述原始语音是否存在语音缺失,所述断续检测用于确定所述测试语音相对于所述原始语音是否存在语音断续;
    根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音,得到对齐后的测试语音和对齐后的原始语音,其中,所述缺失检测和/或所述断续检测的结果用于指示对齐所述测试语音和所述原始语音的方式。
  2. 根据权利要求1所述的方法,其特征在于,所述原始语音包括第一原始语句,所述测试语音包括第一测试语句,所述第一原始语句与所述第一测试语句对应,
    所述根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音,包括:
    当所述测试语音不存在语音缺失和/或语音断续时,且当所述第一测试语句的起始时域位置在所述第一原始语句的起始时域位置之前时,在所述第一测试语句的起始时域位置之前插入第一静默语句,所述第一静默语句的时长等于所述第一测试语句的起始时域位置与所述第一原始语句的起始时域位置的时间差;或者,
    当所述测试语音不存在语音缺失和/或语音断续时,且当所述第一测试语句的起始时域位置在所述第一原始语句的起始时域位置之后时,在所述第一测试语句的起始时域位置之前删除第二静默语句,所述第二静默语句的时长等于所述第一测试语句的起始时域位置与所述第一原始语句的起始时域位置的时间差。
  3. 根据权利要求2所述的方法,其特征在于,在所述第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在所述第一测试语句的起始时域位置之前删除第二静默语句之前,所述根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音,还包括:
    根据所述原始语音中的静默期确定至少两个原始语句,所述至少两个原始语句包括所述第一原始语句,其中,所述原始语音中的静默期用于指示所述至少两个原始语句的划分位置;并且,
    根据所述测试语音中的静默期确定至少两个测试语句,所述至少两个测试语句包括所述第一测试语句,其中,所述测试语音中的静默期用于指示所述至少两个测试语句的划分位置。
  4. 根据权利要求2或3所述的方法,其特征在于,在所述第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在所述第一测试语句的起始时域位置之前删除第二静默语句之前,所述根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音,还包括:
    根据所述第一测试语句的波谷确定第一子测试语句和第二子测试语句,其中,所述波 谷为所述第一测试语句中帧的振幅均值小于或等于振幅阈值的语音片段,所述波谷用于指示所述第一子测试语句和所述第二子测试语句的划分位置;
    根据互相关系数和所述第一子测试语句确定第一子原始语句,所述互相关系数用于指示所述第一原始语句的语音片段与所述第一子测试语句的相似度,所述第一子原始语句为所述第一原始语句的语音片段中与所述第一子测试语句的相似度最高的语音片段;
    根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将所述第一子测试语句与所述第一子原始语句对齐,其中,所述第一时域位置为所述第一子测试语句在所述第一测试语句中的时域位置,所述第二时域位置为所述第一子原始语句在所述第一原始语句中的时域位置。
  5. 根据权利要求4所述的方法,其特征在于,所述根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将所述第一子测试语句与所述第一子原始语句对齐,包括:
    当所述时间偏移量小于或等于时延阈值时,根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将所述第一子测试语句与所述第一子原始语句对齐。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音,还包括:
    当所述测试语音的终止时域位置在所述原始语音的终止时域位置之前时,在所述测试语音的终止时域位置之后添加第三静默语句,所述第三静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差;或者,
    当所述测试语音的终止时域位置在所述原始语音的终止时域位置之后时,在所述测试语音的终止时域位置之后删除第四静默语句,所述第四静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,
    所述根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音之前,所述方法还包括:
    根据预设的异常语音检测模型检测所述测试语音,确定所述测试语音是否属于异常语音,所述预设的异常语音检测模型为非机器学习模型,所述非机器学习模型检测的内容与所述缺失检测所检测的内容相异,和/或,所述非机器学习模型检测的内容与所述断续检测所检测的内容相异。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述方法还包括:
    根据机器学习模型和所述对齐后的原始语音检测所述对齐后的测试语音,确定所述对齐后的测试语音是否属于异常语音,或,确定所述对齐后的测试语音的异常类型。
  9. 一种对齐语音的装置,其特征在于,包括获取单元、检测单元和对齐单元,
    所述获取单元用于:获取原始语音和测试语音,所述测试语音为所述原始语音经过通信网络传输后生成的语音;
    所述检测单元用于:对所述测试语音执行缺失检测和/或断续检测,所述缺失检测用于确定所述测试语音相对于所述原始语音是否存在语音缺失,所述断续检测用于确定所述测试语音相对于所述原始语音是否存在语音断续;
    所述对齐单元用于:根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音,得到对齐后的测试语音和对齐后的原始语音,其中,所述缺失检测和/或所述断续检测的结果用于指示对齐所述测试语音和所述原始语音的方式。
  10. 根据权利要求9所述的装置,其特征在于,所述原始语音包括第一原始语句,所述测试语音包括第一测试语句,所述第一原始语句与所述第一测试语句对应,
    所述对齐单元具体用于:
    当所述测试语音不存在语音缺失和/或语音断续时,且当所述第一测试语句的起始时域位置在所述第一原始语句的起始时域位置之前时,在所述第一测试语句的起始时域位置之前插入第一静默语句,所述第一静默语句的时长等于所述第一测试语句的起始时域位置与所述第一原始语句的起始时域位置的时间差;或者,
    当所述测试语音不存在语音缺失和/或语音断续时,且当所述第一测试语句的起始时域位置在所述第一原始语句的起始时域位置之后时,在所述第一测试语句的起始时域位置之前删除第二静默语句,所述第二静默语句的时长等于所述第一测试语句的起始时域位置与所述第一原始语句的起始时域位置的时间差。
  11. 根据权利要求10所述的装置,其特征在于,在所述第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在所述第一测试语句的起始时域位置之前删除第二静默语句之前,所述对齐单元具体还用于:
    根据所述原始语音中的静默期确定至少两个原始语句,所述至少两个原始语句包括所述第一原始语句,其中,所述原始语音中的静默期用于指示所述至少两个原始语句的划分位置;并且,
    根据所述测试语音中的静默期确定至少两个测试语句,所述至少两个测试语句包括所述第一测试语句,其中,所述测试语音中的静默期用于指示所述至少两个测试语句的划分位置。
  12. 根据权利要求10或11所述的装置,其特征在于,在所述第一测试语句的起始时域位置之前插入第一静默语句之前,或者,在所述第一测试语句的起始时域位置之前删除第二静默语句之前,所述对齐单元具体还用于:
    根据所述第一测试语句的波谷确定第一子测试语句和第二子测试语句,其中,所述波谷为所述第一测试语句中帧的振幅均值小于或等于振幅阈值的语音片段,所述波谷用于指示所述第一子测试语句和所述第二子测试语句的划分位置;
    根据互相关系数和所述第一子测试语句确定第一子原始语句,所述互相关系数用于指示所述第一原始语句的语音片段与所述第一子测试语句的相似度,所述第一子原始语句为所述第一原始语句的语音片段中与所述第一子测试语句的相似度最高的语音片段;
    根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将所述第一子测试语句与所述第一子原始语句对齐,其中,所述第一时域位置为所述第一子测试语句在所述第一测试语句中的时域位置,所述第二时域位置为所述第一子原始语句在所述第一原始语句中的时域位置。
  13. 根据权利要求12所述的装置,其特征在于,所述对齐单元具体还用于:
    当所述时间偏移量小于或等于时延阈值时,根据第一时域位置相对于第二时域位置的时间偏移量并且以所述第一子原始语句的时域位置为参照位置将所述第一子测试语句与 所述第一子原始语句对齐。
  14. 根据权利要求9至13中任一项所述的装置,其特征在于,所述对齐单元具体还用于:
    当所述测试语音的终止时域位置在所述原始语音的终止时域位置之前时,在所述测试语音的终止时域位置之后添加第三静默语句,所述第三静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差;或者,
    当所述测试语音的终止时域位置在所述原始语音的终止时域位置之后时,在所述测试语音的终止时域位置之后删除第四静默语句,所述第四静默语句的时长等于所述测试语音的终止时域位置与所述原始语音的终止时域位置的时间差。
  15. 根据权利要求9至14中任一项所述的装置,其特征在于,
    所述根据所述缺失检测和/或所述断续检测的结果对齐所述测试语音和所述原始语音之前,所述检测单元具体用于:
    根据预设的异常语音检测模型检测所述原始语音和所述测试语音,确定所述测试语音是否属于异常语音,所述预设的异常语音检测模型为非机器学习模型,所述非机器学习模型检测的内容与所述缺失检测所检测的内容相异,和/或,所述非机器学习模型检测的内容与所述断续检测所检测的内容相异。
  16. 根据权利要求9至15中任一项所述的装置,其特征在于,所述检测单元还用于:
    根据机器学习模型和所述对齐后的原始语音检测所述对齐后的测试语音,确定所述对齐后的测试语音是否属于异常语音,或,确定所述对齐后的测试语音的异常类型。
  17. 一种对齐语音的设备,其特征在于,包括:
    存储器,用于存储指令,
    处理器,与所述存储器耦合,用于调用所述存储器存储的指令执行权利要求1至权利要求8中任一项所述的方法的步骤。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储了计算机程序代码,当所述计算机程序代码被处理单元或处理器执行时,对齐语音的装置或设备执行权利要求1至权利要求8中任一项所述的方法的步骤。
PCT/CN2019/088591 2018-05-28 2019-05-27 对齐语音的方法和装置 WO2019228306A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19811458.9A EP3764361B1 (en) 2018-05-28 2019-05-27 Method and apparatus for aligning voices
FIEP19811458.9T FI3764361T3 (fi) 2018-05-28 2019-05-27 Menetelmä ja laite äänten kohdistamista varten
US17/068,131 US11631397B2 (en) 2018-05-28 2020-10-12 Voice alignment method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810519857.2A CN109903752B (zh) 2018-05-28 2018-05-28 对齐语音的方法和装置
CN201810519857.2 2018-05-28

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/068,131 Continuation US11631397B2 (en) 2018-05-28 2020-10-12 Voice alignment method and apparatus

Publications (1)

Publication Number Publication Date
WO2019228306A1 true WO2019228306A1 (zh) 2019-12-05

Family

ID=66943231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088591 WO2019228306A1 (zh) 2018-05-28 2019-05-27 对齐语音的方法和装置

Country Status (5)

Country Link
US (1) US11631397B2 (zh)
EP (1) EP3764361B1 (zh)
CN (1) CN109903752B (zh)
FI (1) FI3764361T3 (zh)
WO (1) WO2019228306A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111199751B (zh) * 2020-03-04 2021-04-13 北京声智科技有限公司 一种麦克风的屏蔽方法、装置和电子设备
CN111489759A (zh) * 2020-03-23 2020-08-04 天津大学 基于光纤语音时域信号波形对齐的噪声评估方法
JP2021177598A (ja) * 2020-05-08 2021-11-11 シャープ株式会社 音声処理システム、音声処理方法、及び音声処理プログラム
CN111797708A (zh) * 2020-06-12 2020-10-20 瑞声科技(新加坡)有限公司 气流杂音检测方法、装置、终端及存储介质
CN111798868B (zh) 2020-09-07 2020-12-08 北京世纪好未来教育科技有限公司 语音强制对齐模型评价方法、装置、电子设备及存储介质
CN116597829B (zh) * 2023-07-18 2023-09-08 西兴(青岛)技术服务有限公司 一种提高语音识别精度的降噪处理方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002087137A2 (en) * 2001-04-24 2002-10-31 Nokia Corporation Methods for changing the size of a jitter buffer and for time alignment, communications system, receiving end, and transcoder
CN101111041A (zh) * 2007-08-09 2008-01-23 张科任 移动通信网络远程检测系统及通话质量远程检测方法
CN102044248A (zh) * 2009-10-10 2011-05-04 北京理工大学 一种针对流媒体音频质量的客观评测方法
CN102044247A (zh) * 2009-10-10 2011-05-04 北京理工大学 一种针对VoIP语音的客观评测方法
CN103474083A (zh) * 2013-09-18 2013-12-25 中国人民解放军电子工程学院 基于正交正弦脉冲序列定位标签的语音时间规整方法
CN103685795A (zh) * 2013-12-13 2014-03-26 广州华多网络科技有限公司 网络语音通信中的数据对齐方法和系统
CN105989837A (zh) * 2015-02-06 2016-10-05 中国电信股份有限公司 音频匹配方法及装置

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07283773A (ja) * 1994-04-06 1995-10-27 Fujitsu Ltd 移動局装置および基地局装置
IL132172A (en) * 1997-05-16 2003-10-31 British Telecomm Method and device for measurement of telecom signal quality
WO2000022803A1 (en) * 1998-10-08 2000-04-20 British Telecommunications Public Limited Company Measurement of speech signal quality
JP3049235B2 (ja) * 1998-11-17 2000-06-05 松下電器産業株式会社 複合的な文法ネットワークを用いる音声認識システム
US6499009B1 (en) * 1999-10-29 2002-12-24 Telefonaktiebolaget Lm Ericsson Handling variable delay in objective speech quality assessment
US20030023435A1 (en) * 2000-07-13 2003-01-30 Josephson Daryl Craig Interfacing apparatus and methods
US7197010B1 (en) * 2001-06-20 2007-03-27 Zhone Technologies, Inc. System for real time voice quality measurement in voice over packet network
CN100542077C (zh) * 2003-05-26 2009-09-16 华为技术有限公司 一种音频同步对齐测试方法
US20050216260A1 (en) * 2004-03-26 2005-09-29 Intel Corporation Method and apparatus for evaluating speech quality
US20070033027A1 (en) * 2005-08-03 2007-02-08 Texas Instruments, Incorporated Systems and methods employing stochastic bias compensation and bayesian joint additive/convolutive compensation in automatic speech recognition
DE102006044929B4 (de) * 2006-09-22 2008-10-23 Opticom Dipl.-Ing. Michael Keyhl Gmbh Vorrichtung zum Bestimmen von Informationen zur zeitlichen Ausrichtung zweier Informationssignale
EP1975924A1 (en) * 2007-03-29 2008-10-01 Koninklijke KPN N.V. Method and system for speech quality prediction of the impact of time localized distortions of an audio transmission system
CN101466116A (zh) 2008-12-19 2009-06-24 华为技术有限公司 一种确定发生故障的方法、故障定位方法和定位系统
CN101771869B (zh) * 2008-12-30 2011-09-28 深圳市万兴软件有限公司 一种音视频编解码装置及方法
JP5319788B2 (ja) * 2009-01-26 2013-10-16 テレフオンアクチーボラゲット エル エム エリクソン(パブル) オーディオ信号のアライメント方法
DK2465113T3 (en) * 2009-08-14 2015-04-07 Koninkl Kpn Nv PROCEDURE, COMPUTER PROGRAM PRODUCT AND SYSTEM FOR DETERMINING AN CONCEPT QUALITY OF A SOUND SYSTEM
EP2474975B1 (en) * 2010-05-21 2013-05-01 SwissQual License AG Method for estimating speech quality
CN101996662B (zh) * 2010-10-22 2012-08-08 深圳市万兴软件有限公司 视频文件的连接输出方法和装置
US9524733B2 (en) * 2012-05-10 2016-12-20 Google Inc. Objective speech quality metric
US20140180457A1 (en) * 2012-12-26 2014-06-26 Anshuman Thakur Electronic device to align audio flow
CN103077727A (zh) 2013-01-04 2013-05-01 华为技术有限公司 一种用于语音质量监测和提示的方法和装置
EP2922058A1 (en) * 2014-03-20 2015-09-23 Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO Method of and apparatus for evaluating quality of a degraded speech signal
CN104376850B (zh) * 2014-11-28 2017-07-21 苏州大学 一种汉语耳语音的基频估计方法
CN104464755B (zh) 2014-12-02 2018-01-16 科大讯飞股份有限公司 语音评测方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002087137A2 (en) * 2001-04-24 2002-10-31 Nokia Corporation Methods for changing the size of a jitter buffer and for time alignment, communications system, receiving end, and transcoder
CN101111041A (zh) * 2007-08-09 2008-01-23 张科任 移动通信网络远程检测系统及通话质量远程检测方法
CN102044248A (zh) * 2009-10-10 2011-05-04 北京理工大学 一种针对流媒体音频质量的客观评测方法
CN102044247A (zh) * 2009-10-10 2011-05-04 北京理工大学 一种针对VoIP语音的客观评测方法
CN103474083A (zh) * 2013-09-18 2013-12-25 中国人民解放军电子工程学院 基于正交正弦脉冲序列定位标签的语音时间规整方法
CN103685795A (zh) * 2013-12-13 2014-03-26 广州华多网络科技有限公司 网络语音通信中的数据对齐方法和系统
CN105989837A (zh) * 2015-02-06 2016-10-05 中国电信股份有限公司 音频匹配方法及装置

Also Published As

Publication number Publication date
CN109903752B (zh) 2021-04-20
FI3764361T3 (fi) 2023-09-20
EP3764361B1 (en) 2023-08-30
EP3764361A1 (en) 2021-01-13
EP3764361A4 (en) 2021-06-23
US20210027769A1 (en) 2021-01-28
CN109903752A (zh) 2019-06-18
US11631397B2 (en) 2023-04-18

Similar Documents

Publication Publication Date Title
WO2019228306A1 (zh) 对齐语音的方法和装置
US11276407B2 (en) Metadata-based diarization of teleconferences
US8543402B1 (en) Speaker segmentation in noisy conversational speech
CN108962282B (zh) 语音检测分析方法、装置、计算机设备及存储介质
JP6800946B2 (ja) 音声区間の認識方法、装置及び機器
WO2021128741A1 (zh) 语音情绪波动分析方法、装置、计算机设备及存储介质
CN105161093B (zh) 一种判断说话人数目的方法及系统
US9368116B2 (en) Speaker separation in diarization
US20190385636A1 (en) Voice activity detection method and apparatus
CN111128223B (zh) 一种基于文本信息的辅助说话人分离方法及相关装置
US8976941B2 (en) Apparatus and method for reporting speech recognition failures
CN106847305B (zh) 一种处理客服电话的录音数据的方法及装置
US20170039440A1 (en) Visual liveness detection
CN110060665A (zh) 语速检测方法及装置、可读存储介质
CN109785846B (zh) 单声道的语音数据的角色识别方法及装置
WO2021042537A1 (zh) 语音识别认证方法及系统
JP5385677B2 (ja) 対話状態分割装置とその方法、そのプログラムと記録媒体
WO2022033109A1 (zh) 语音检测方法、装置和电子设备
CN111833902A (zh) 唤醒模型训练方法、唤醒词识别方法、装置及电子设备
CN105575402A (zh) 网络教学实时语音分析方法
US10522160B2 (en) Methods and apparatus to identify a source of speech captured at a wearable electronic device
CN109065024B (zh) 异常语音数据检测方法及装置
CN110867193A (zh) 一种段落英语口语评分方法及系统
CN112802498B (zh) 语音检测方法、装置、计算机设备和存储介质
WO2023193573A1 (zh) 一种音频处理方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19811458

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019811458

Country of ref document: EP

Effective date: 20201008

NENP Non-entry into the national phase

Ref country code: DE