WO2021179470A1 - Procédé, dispositif et système de reconnaissance d'une vitesse d'échantillonnage de données vocales pures - Google Patents

Procédé, dispositif et système de reconnaissance d'une vitesse d'échantillonnage de données vocales pures Download PDF

Info

Publication number
WO2021179470A1
WO2021179470A1 PCT/CN2020/097008 CN2020097008W WO2021179470A1 WO 2021179470 A1 WO2021179470 A1 WO 2021179470A1 CN 2020097008 W CN2020097008 W CN 2020097008W WO 2021179470 A1 WO2021179470 A1 WO 2021179470A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency
sampling rate
data
voice data
hypothetical
Prior art date
Application number
PCT/CN2020/097008
Other languages
English (en)
Chinese (zh)
Inventor
刘兵兵
包飞
吴科苇
刘如意
车洋
Original Assignee
苏宁易购集团股份有限公司
苏宁云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁易购集团股份有限公司, 苏宁云计算有限公司 filed Critical 苏宁易购集团股份有限公司
Priority to CA3175103A priority Critical patent/CA3175103A1/fr
Publication of WO2021179470A1 publication Critical patent/WO2021179470A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the invention belongs to the technical field of speech processing, and in particular relates to a method, device and system for recognizing the sampling rate of pure speech data.
  • the sampling rate defines the number of samples extracted from a continuous signal per second to form a discrete signal. It is used to describe the sound quality and tone of a sound file, and is a measure of the quality of sound cards and sound files.
  • Voice data refers to voice data without sampling rate information. If the sampling rate is not known, if the processing is performed according to the wrong sampling rate, the result of the voice processing Large deviations often occur, resulting in poor output voice effects, which in turn affects the user experience of voice products.
  • the present invention proposes a method, device, and system for recognizing the sampling rate of pure voice data.
  • the method can improve the appearance of voice data packets transmitted over the network. Bad packet phenomenon, and increase the sampling rate check and remind function of voice communication engine, common audio processing software, etc., to improve the robustness of voice processing equipment.
  • the present invention provides a method for identifying the sampling rate of pure voice data, the method including:
  • the different hypothetical frequencies are compared with the prior frequencies, and the sampling rate corresponding to the hypothetical frequency when the comparison results are most similar is determined as the actual sampling rate.
  • the a priori frequency ranges from 200 Hz to 4000 Hz.
  • comparing different hypothetical frequencies with a priori frequencies, and determining the sampling rate corresponding to the hypothetical frequency when the comparison result is the most similar as the actual sampling rate specifically includes:
  • the Euclidean distance between different hypothetical frequencies and the prior frequency is calculated, and the sampling rate corresponding to the hypothetical frequency when the Euclidean distance is the smallest is determined as the actual sampling rate.
  • the method further includes:
  • the pure speech data is pure speech data including speech fragments
  • the method for obtaining pure voice data including voice fragments includes:
  • the voice data corresponding to the energy greater than the energy threshold is obtained to obtain the pure voice data including the voice segment.
  • the method further includes:
  • the pure voice data is decoded according to the actual sampling rate.
  • the present invention provides a pure voice data sampling rate recognition device, including:
  • the conversion module is used to perform Fourier transform on pure voice data to obtain frequency domain data
  • An obtaining module configured to process the frequency domain data according to the received a priori threshold data to obtain frequency band information; and to obtain a high-frequency cutoff frequency point of the frequency band information;
  • the calculation module is used to calculate the corresponding hypothetical frequency at the high-frequency cut-off frequency point according to different preset sampling rates
  • the processing module is used to compare different hypothetical frequencies with a priori frequencies, and determine the sampling rate corresponding to the hypothetical frequency when the comparison result is the most similar as the actual sampling rate.
  • the a priori frequency ranges from 200 Hz to 4000 Hz.
  • the processing module is specifically used for:
  • the Euclidean distance between different hypothetical frequencies and the prior frequency is calculated, and the sampling rate corresponding to the hypothetical frequency when the Euclidean distance is the smallest is determined as the actual sampling rate.
  • the conversion module is further configured to perform a normalization process on the frequency domain data after performing Fourier transform on the pure speech data to obtain frequency domain data.
  • the pure speech data is pure speech data including speech fragments
  • the device also includes:
  • the receiving module is used to receive voice data
  • the conversion module is further configured to perform Fourier transform on the voice data to obtain the energy of the voice data when the voice data does not include sampling rate information;
  • the processing module is further configured to obtain voice data corresponding to energy greater than the energy threshold according to a preset energy threshold to obtain the pure voice data including the voice segment.
  • the device further includes:
  • the decoding module is configured to decode the pure voice data according to the actual sampling rate.
  • the present invention provides a computer system, including:
  • One or more processors are One or more processors.
  • a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
  • the different hypothetical frequencies are compared with the prior frequencies, and the sampling rate corresponding to the hypothetical frequency when the comparison results are most similar is determined as the actual sampling rate.
  • the present invention compares with the different hypothetical frequencies of pure voice data, and can determine the actual sampling rate according to the similarity value of the comparison result, thereby Automatically predict the sample rate of pure voice data to prevent problems such as a large impact on the effect of voice processing when the sample rate is unknown, thereby improving the phenomenon of bad packets in the network transmission of voice data packets, and increasing voice Sampling rate checking and reminding functions of communication engine, common audio processing software, etc., improve the robustness of voice processing equipment.
  • FIG. 1 is a flowchart of a method for recognizing the sampling rate of pure voice data according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic structural diagram of a pure voice data sampling rate recognition device provided in the second embodiment of the present application.
  • Fig. 3 is a schematic structural diagram of a computer system provided in the third embodiment of the present application.
  • the present application provides a method for identifying the sampling rate of pure voice data, which can be applied to an audio device, and the audio device performs the following processing:
  • the pure voice data is pure voice data including voice fragments, and the above steps specifically include:
  • the relevant information of the voice data is obtained and analyzed, and the relevant information includes data packets, sampling rate, and so on.
  • the voice data corresponding to the energy greater than the energy threshold is obtained to obtain pure voice data including the voice segment.
  • the above receiving voice data and preprocessing to obtain pure voice data can also be implemented through the following steps:
  • sampling rate information is not included in the voice data, filter the voice data to obtain pure voice data including voice fragments.
  • the purpose of the filtering process is to remove information such as noise and silence in the effective voice data, thereby enhancing the voice data.
  • the a priori threshold data is obtained by processing the prior information.
  • this step may include:
  • the Euclidean distance When the Euclidean distance is the smallest, it indicates that the hypothetical frequency at this time and the prior frequency have the highest similarity, and thus the sampling rate corresponding to the hypothetical frequency is closest to the actual value. In this way, the actual sampling rate of pure voice data can be obtained.
  • the normal process of decoding and playing can be performed according to the identified sampling rate, thereby improving the quality and experience of voice calls.
  • the above methods can also be used to predict the sampling rate, remind the staff in time, and reduce unnecessary risks and losses; for common audio processing
  • the software can promptly remind the user that the sampling rate is set incorrectly, and reduce the time waste and redundant operation of the user in life and work.
  • the present application also provides a pure voice data sampling rate recognition device, including:
  • the conversion module 21 is configured to perform Fourier transform on pure voice data to obtain frequency domain data
  • the obtaining module 22 is configured to process the frequency domain data according to the received a priori threshold data to obtain frequency band information; and to obtain the high-frequency cutoff frequency point of the frequency band information;
  • the calculation module 23 is configured to calculate the corresponding hypothetical frequency at the high-frequency cut-off frequency point according to different preset sampling rates
  • the processing module 24 is configured to compare different hypothetical frequencies with a priori frequencies, and determine the sampling rate corresponding to the hypothetical frequency when the comparison result is the most similar as the actual sampling rate.
  • the a priori frequency ranges from 200 Hz to 4000 Hz.
  • the above-mentioned processing module 24 is specifically configured to calculate the Euclidean distance between different hypothetical frequencies and the prior frequency, and determine the sampling rate corresponding to the hypothetical frequency when the Euclidean distance is the smallest as the actual sampling rate.
  • the above-mentioned conversion module 21 is further configured to perform a normalization process on the frequency domain data after performing Fourier transform on the pure speech data to obtain frequency domain data.
  • the above-mentioned pure speech data is pure speech data including speech fragments
  • the above device also includes:
  • the receiving module 25 is used to receive voice data
  • the analysis module 26 is used to analyze voice data
  • the conversion module 21 is also used to perform Fourier transform on the voice data to obtain the energy of the voice data when the voice data does not include sampling rate information;
  • the above-mentioned processing module 24 is further configured to obtain speech data corresponding to energy greater than the energy threshold according to a preset energy threshold to obtain pure speech data including speech fragments.
  • the above-mentioned device further includes:
  • the decoding module 27 is used to decode pure voice data according to the actual sampling rate.
  • This application also provides a computer system, including:
  • One or more processors are One or more processors.
  • a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
  • the different hypothetical frequencies are compared with the prior frequencies, and the sampling rate corresponding to the hypothetical frequency when the comparison results are most similar is determined as the actual sampling rate.
  • FIG. 3 exemplarily shows the architecture of the computer system, which may specifically include a processor 32, a video display adapter 34, a disk drive 36, an input/output interface 38, a network interface 310, and a memory 312.
  • the processor 32, the video display adapter 34, the disk drive 36, the input/output interface 38, the network interface 310, and the memory 312 may be communicatively connected through the communication bus 314.
  • the processor 32 may be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. Perform relevant procedures to realize the technical solutions provided in this application.
  • a general CPU Central Processing Unit, central processing unit
  • a microprocessor central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 312 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc.
  • the memory 312 may store an operating system 316 for controlling the operation of the computer system 30, and a basic input output system (BIOS) 318 for controlling low-level operations of the computer system.
  • BIOS basic input output system
  • a web browser 320, a data storage management system 322, etc. can also be stored.
  • the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 312 and called and executed by the processor 32.
  • the input/output interface 38 is used to connect the input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the network interface 310 is used to connect a communication module (not shown in the figure) to implement communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the communication bus 314 includes a path to transmit information between various components of the device (for example, the processor 32, the video display adapter 34, the disk drive 36, the input/output interface 38, the network interface 310, and the memory 312).
  • the computer system can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database for condition judgment, and so on.
  • the above device only shows the processor 32, the video display adapter 34, the disk drive 36, the input/output interface 38, the network interface 310, the memory 312, the communication bus 314, etc., in the specific implementation process,
  • the device may also include other components necessary for normal operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Les modes de réalisation de la présente invention concernent un procédé, un dispositif et un système permettant de reconnaître une vitesse d'échantillonnage de données vocales pures. Le procédé consiste : à exécuter une transformée de Fourier sur des données vocales pures pour obtenir des données d'un domaine fréquentiel ; selon les données de seuil antérieures reçues, à traiter les données du domaine fréquentiel pour obtenir des informations de bande de fréquence ; à acquérir un point de fréquence de coupure haute fréquence des informations de bande de fréquence, et à calculer, selon des vitesses d'échantillonnage différentes prédéfinies, des fréquences hypothétiques correspondant au point de fréquence de coupure haute fréquence ; à comparer les différentes fréquences hypothétiques avec une fréquence précédente, et à déterminer la vitesse d'échantillonnage correspondant à la fréquence hypothétique lorsque le résultat de la comparaison est le plus similaire à la vitesse d'échantillonnage réelle. Dans la présente invention, en fonction de la caractéristique antérieure de la plage de largeur de bande dans le domaine fréquentiel d'une voix émise par un individu qui est comprise entre 200 Hz et 4000 Hz, différentes fréquences hypothétiques de données vocales pures sont comparées, et la vitesse d'échantillonnage réelle peut être déterminée en fonction de la valeur de la similarité du résultat de la comparaison. Ainsi, l'amplitude de la vitesse d'échantillonnage des données vocales pures peut être automatiquement prédite, ce qui permet d'éviter des problèmes tels que les effets considérables produits sur un traitement vocal lorsque la vitesse d'échantillonnage est inconnue.
PCT/CN2020/097008 2020-03-10 2020-06-19 Procédé, dispositif et système de reconnaissance d'une vitesse d'échantillonnage de données vocales pures WO2021179470A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3175103A CA3175103A1 (fr) 2020-03-10 2020-06-19 Procede, dispositif et systeme de reconnaissance d'une vitesse d'echantillonnage de donnees vocales pures

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010160577.4A CN111354365B (zh) 2020-03-10 2020-03-10 一种纯语音数据采样率识别方法、装置、系统
CN202010160577.4 2020-03-10

Publications (1)

Publication Number Publication Date
WO2021179470A1 true WO2021179470A1 (fr) 2021-09-16

Family

ID=71196071

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097008 WO2021179470A1 (fr) 2020-03-10 2020-06-19 Procédé, dispositif et système de reconnaissance d'une vitesse d'échantillonnage de données vocales pures

Country Status (3)

Country Link
CN (1) CN111354365B (fr)
CA (1) CA3175103A1 (fr)
WO (1) WO2021179470A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447713B (zh) * 2021-06-25 2023-03-07 南京丰道电力科技有限公司 一种基于傅式快速高精度的电力系统频率测量方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320560A (zh) * 2008-07-01 2008-12-10 上海大学 语音识别系统应用采样速率转化提高识别率的方法
CN103745726A (zh) * 2013-11-07 2014-04-23 中国电子科技集团公司第四十一研究所 一种自适应的变采样率音频采样方法
CN105513590A (zh) * 2015-11-23 2016-04-20 百度在线网络技术(北京)有限公司 语音识别的方法和装置
CN109328383A (zh) * 2016-06-27 2019-02-12 高通股份有限公司 使用中间采样率的音频解码

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7046857B2 (en) * 1997-07-31 2006-05-16 The Regents Of The University Of California Apparatus and methods for image and signal processing
CN101582264A (zh) * 2009-06-12 2009-11-18 瑞声声学科技(深圳)有限公司 语音增强的方法及语音增加的声音采集系统
JP2012002858A (ja) * 2010-06-14 2012-01-05 Pioneer Electronic Corp タイムスケーリング方法、ピッチシフト方法、オーディオデータ処理装置およびプログラム
CN102332266B (zh) * 2010-07-13 2013-04-24 炬力集成电路设计有限公司 一种音频数据的编码方法及装置
CN105247613B (zh) * 2013-04-05 2019-01-18 杜比国际公司 音频处理系统
CN107833581B (zh) * 2017-10-20 2021-04-13 广州酷狗计算机科技有限公司 一种提取声音的基音频率的方法、装置及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320560A (zh) * 2008-07-01 2008-12-10 上海大学 语音识别系统应用采样速率转化提高识别率的方法
CN103745726A (zh) * 2013-11-07 2014-04-23 中国电子科技集团公司第四十一研究所 一种自适应的变采样率音频采样方法
CN105513590A (zh) * 2015-11-23 2016-04-20 百度在线网络技术(北京)有限公司 语音识别的方法和装置
CN109328383A (zh) * 2016-06-27 2019-02-12 高通股份有限公司 使用中间采样率的音频解码

Also Published As

Publication number Publication date
CN111354365B (zh) 2023-10-31
CA3175103A1 (fr) 2021-09-16
CN111354365A (zh) 2020-06-30

Similar Documents

Publication Publication Date Title
JP7210634B2 (ja) 音声クエリの検出および抑制
US9412371B2 (en) Visualization interface of continuous waveform multi-speaker identification
US10339956B2 (en) Method and apparatus for detecting audio signal according to frequency domain energy
WO2020181824A1 (fr) Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur
US9258425B2 (en) Method and system for speaker verification
WO2019134247A1 (fr) Procédé d'enregistrement d'empreinte vocale basé sur un modèle de reconnaissance d'empreinte vocale, dispositif terminal et support d'informations
US20140358264A1 (en) Audio playback method, apparatus and system
WO2020198354A1 (fr) Détection d'appels provenant d'assistants vocaux
WO2021042537A1 (fr) Procédé et système d'authentification de reconnaissance vocale
WO2021000498A1 (fr) Procédé, dispositif et appareil de reconnaissance de parole composite et support d'informations lisible par ordinateur
CN107580155B (zh) 网络电话质量确定方法、装置、计算机设备和存储介质
US20060100866A1 (en) Influencing automatic speech recognition signal-to-noise levels
WO2014194641A1 (fr) Procédé, appareil et système de lecture audio
CN111343660B (zh) 一种应用程序的测试方法及设备
WO2021051566A1 (fr) Procédé de reconnaissance de la parole synthétisée par machine, appareil, dispositif électronique et support de stockage
CN111916109A (zh) 一种基于特征的音频分类方法、装置及计算设备
WO2021179470A1 (fr) Procédé, dispositif et système de reconnaissance d'une vitesse d'échantillonnage de données vocales pures
US20100172479A1 (en) Dynamically improving performance of an interactive voice response (ivr) system using a complex events processor (cep)
US11146607B1 (en) Smart noise cancellation
WO2018032760A1 (fr) Procédé et appareil de traitement d'informations vocales
KR20170010978A (ko) 통화 내용 패턴 분석을 통한 보이스 피싱 방지 방법 및 장치
CN113271386B (zh) 啸叫检测方法及装置、存储介质、电子设备
CN113782036A (zh) 音频质量评估方法、装置、电子设备和存储介质
WO2021143095A1 (fr) Procédé et appareil de test de numérotation, dispositif informatique et support d'enregistrement
WO2020186695A1 (fr) Procédé et appareil de traitement par lots d'informations vocales, dispositif informatique et support de stockage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20924735

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3175103

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20924735

Country of ref document: EP

Kind code of ref document: A1