CN108475512A - A kind of audio frequency playing method, system and device - Google Patents

A kind of audio frequency playing method, system and device Download PDF

Info

Publication number
CN108475512A
CN108475512A CN201780001736.2A CN201780001736A CN108475512A CN 108475512 A CN108475512 A CN 108475512A CN 201780001736 A CN201780001736 A CN 201780001736A CN 108475512 A CN108475512 A CN 108475512A
Authority
CN
China
Prior art keywords
audio
instruction
frame
noise
input module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201780001736.2A
Other languages
Chinese (zh)
Other versions
CN108475512B (en
Inventor
朱华明
武巍
宁洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jin Rui De Lu Technology Co ltd
Original Assignee
Beijing Jin Rui De Lu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jin Rui De Lu Technology Co ltd filed Critical Beijing Jin Rui De Lu Technology Co ltd
Publication of CN108475512A publication Critical patent/CN108475512A/en
Application granted granted Critical
Publication of CN108475512B publication Critical patent/CN108475512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/16Storage of analogue signals in digital stores using an arrangement comprising analogue/digital [A/D] converters, digital memories and digital/analogue [D/A] converters 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/10Earpieces; Attachments therefor ; Earphones; Monophonic headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A kind of audio frequency playing method of present invention offer, system and device, wherein method include:It is further comprising the steps of including opening input module:Operational order is obtained by the input module;The operational order is parsed, and generates control instruction;The control instruction is executed, audio file is obtained;Play the audio file.The present invention is instructed by voice input control, controls the broadcasting of audio, be can be applied on various smart machines and wearable device, has been liberated both hands, and the application experience of user is improved.

Description

Audio playing method, system and device Technical Field
The present invention relates to the technical field of audio processing and playing, and in particular, to an audio playing method, system and device.
Background
People traditionally use earphones to listen to music, most earphones are wired to a player that provides audio, and newer bluetooth earphones can be wirelessly connected to the player. Although the types of players, from early tape players, CD players, to later smart terminals such as ipod players, smart phones, tablet computers, and PCs that have been popularized for many years, are continuously changed, the players always play roles of storing and outputting audio signals, and receiving user operations and controlling playback in most cases. Such a combination is very inconvenient in view of the present. Along with the development of intelligent wearing equipment and the continuous improvement of people's standard of living, the use of various intelligent wearing equipment like intelligent wrist-watch is more and more popularized, and intelligent wearing equipment has become the indispensable communication instrument in people's life.
Most of the existing wearable devices still need to be operated manually, so that the music can be played normally. How to obtain simple high-efficient and the very strong experience of maneuverability, occupy both hands as far as possible is the problem that wearable equipment urgently needed to solve.
Patent application No. 105097001a discloses an audio playing method and apparatus, wherein the method includes: collecting external sound signals, and identifying the collected sound signals; when the collected sound signals are determined to correspond to the corresponding audio playing control commands according to the identification results, aiming at the audio files stored in the audio playing device in advance, executing the audio playing operation of the corresponding audio files according to the audio playing control commands; and the audio playing device generates mechanical vibration with corresponding frequency according to the audio signal under the condition of outputting the audio signal based on the audio playing operation II. Although the bone conduction module can be used, the bone conduction module is only applied to audio playing, and collected sound instructions are not processed, so that the obtained audio instructions are unclear and correct control instructions cannot be obtained. Meanwhile, the application intelligently plays the audio files stored in the local memory, and can not acquire more audio files through the network.
Disclosure of Invention
In order to solve the technical problems, the invention provides an audio playing method, system and device, which can make the sound signal more clear and accurate by performing deep processing on the input sound signal, and can realize the anytime and anywhere audio file playing by downloading from a cloud server through a network.
The first aspect of the present invention provides an audio playing method, which includes the steps of starting an input module, including:
step 1: acquiring an operation instruction through the input module;
step 2: analyzing the operation instruction and generating a control instruction;
and step 3: executing the control instruction to obtain an audio file;
and 4, step 4: and playing the audio file.
Preferably, the input module comprises at least one of an audio input module, a text input module and a gesture input module; the audio input module receives an audio instruction signal and then generates an effective audio instruction signal; the character input module generates a character instruction signal; the gesture input module generates a gesture command signal.
In any of the above aspects, it is preferable that the audio input module includes at least one microphone and one bone conduction microphone.
In any of the above aspects, it is preferable that the audio signal includes a first audio signal and a second audio signal.
In any of the above aspects, it is preferable that the first audio signal is a mechanical wave collected by the bone conduction microphone due to vibration of the user's body.
In any of the above aspects, preferably, the second audio signal is a sound wave in a time range generated by collecting the mechanical wave with the microphone.
In any of the above solutions, preferably, the method for obtaining the operation instruction through the audio input module includes the following sub-steps:
step 11: carrying out audio characteristic detection on the collected audio instruction signal;
step 12: judging a main sound source;
step 13: noise is eliminated;
step 14: outputting the valid audio instruction signal.
In any of the above aspects, preferably, the audio characteristic detection includes at least one of voice detection, noise detection, and correlation feature extraction.
In any of the above schemes, preferably, the audio characteristic detection method extracts audio data x with a frame length Tms each timei(n) and calculating the average energy EiZero crossing rate ZCRiShort time correlation RiAnd short-time cross-correlation Cij(k),
Wherein the content of the first and second substances,
in any of the above schemes, preferably, the method for detecting the audio characteristic further comprises detecting the average energy E according to the average energyiThe zero crossing rate ZCRiThe short-time correlation RiAnd the short-time cross-correlation Cij(k) Calculating the non-silence probability of the current frameAnd probability of speech
Wherein the content of the first and second substances,is i channel max (E)i*ZCRi) Is determined by the empirical reference value of (a),is i channel max { max [ R ]i(k)]*max[Cij(k)]An empirical reference value.
In any of the above schemes, preferably, the method for detecting audio characteristics further comprises determining the non-silence probability of the current frame according to the i-channelAnd the speech probabilityJudging the type of the current frame, namely whether the current frame is a noise frame, a voice frame or a noise-free environment voice frame,
wherein,the formula is an empirical value of relevant judgment, the Ambient sound frame without Noise is used as Ambient sound frame, Noise is used as Noise frame, and Speech frame is used as Speech frame.
In any of the above schemes, preferably, the step 32 is to determine the main data path according to the principle of determining the main sound source.
In any of the above schemes, preferably, the principle of determining the dominant sound source includes:
1) when one path is Speech and the other path is Ambient or Noise, determining the path as the main data path of the current position frame;
2) when one path is Ambient and the other path is Noise, determining the path as the main data path of the current position frame;
3) when both paths are the same kind of frame, determiningThe channel with the largest value is used as the main data channel of the current position frame.
In any of the above schemes, preferably, in step 13, Noise spectrum characteristics are obtained according to Noise frames associated before and after the main data path Speech audio frame, and Noise spectrum components of the Speech audio frame are suppressed in a frequency domain.
In any of the above aspects, preferably, the operation instruction includes at least one of the valid audio instruction signal, the text instruction signal, or the gesture instruction signal.
In any of the above schemes, preferably, the control instruction includes at least one of a search instruction, a filtering instruction, a caching instruction, a downloading instruction, a storing instruction, and a playing instruction.
In any of the above schemes, preferably, the search instruction is to search preferentially in a local storage, and if not, search in the cloud through the communication component.
In any of the above aspects, preferably, the communication component includes at least one of wifi, wireless, 2G/3G/4G/5G, and GPRS.
In any of the above schemes, preferably, the obtaining of the audio file refers to executing the caching instruction or the downloading instruction, and obtaining the audio file from a cloud through the communication component.
In any of the above schemes, preferably, the play instruction refers to playing the cached audio file or the audio file in the local storage through a playback device.
The second part of the invention discloses a sound collection system, which comprises an input module and the following modules:
an operation instruction acquisition module: acquiring an operation instruction through the input module;
the operation instruction analysis module: analyzing the operation instruction and generating a control instruction;
an audio file acquisition module: the control instruction is executed to acquire an audio file;
the audio file playing module: the system is used for pushing the valid audio data of the audio file to the terminal equipment.
Preferably, the input module comprises at least one of an audio input module, a text input module and a gesture input module; the audio input module receives an audio instruction signal and then generates an effective audio instruction signal; the character input module generates a character instruction signal; the gesture input module generates a gesture command signal.
In any of the above aspects, it is preferable that the audio input module includes at least one bone conduction microphone and at least one microphone.
In any of the above aspects, it is preferable that the audio signal includes a first audio signal and a second audio signal.
In any of the above aspects, it is preferable that the first audio signal is a mechanical wave collected by the bone conduction microphone due to vibration of the user's body.
In any of the above aspects, preferably, the second audio signal is a sound wave in a time range generated by collecting the mechanical wave with the microphone.
In any of the above schemes, preferably, the operation instruction obtaining module further includes the following sub-modules:
the audio characteristic detection submodule: the audio characteristic detection device is used for detecting the audio characteristics of the acquired audio signals;
a master sound source judgment submodule: used for judging the main sound source;
a noise reduction submodule: for eliminating noise;
the audio instruction output submodule: for outputting the valid audio instruction signal.
In any of the above aspects, preferably, the audio characteristic detection includes at least one of voice detection, noise detection, and correlation feature extraction.
In any of the above schemes, preferably, the audio characteristic detection method extracts audio data x with a frame length Tms each timei(n) and calculating the average energy EiZero crossing rate ZCRiShort time correlation RiAnd short-time cross-correlation Cij(k),
Wherein the content of the first and second substances,
in any of the above schemes, preferably, the method for detecting the audio characteristic further comprises detecting the average energy E according to the average energyiThe zero crossing rate ZCRiThe short-time correlation RiAnd the short-time cross-correlation Cij(k) Calculating the non-silence probability of the current frameAnd probability of speech
Wherein the content of the first and second substances,is i channel max (E)i*ZCRi) Is determined by the empirical reference value of (a),is i channel max { max [ R ]i(k)]*max[Cij(k)]An empirical reference value.
In any of the above schemes, preferably, the method for detecting audio characteristics further comprises determining the non-silence probability of the current frame according to the i-channelAnd the speech probabilityJudging the type of the current frame, namely whether the current frame is a noise frame, a voice frame or a noise-free environment voice frame,
wherein,the method is an empirical value of relevant judgment, the Ambient sound frame without Noise is used as Ambient sound frame, Noise is used as Noise frame, and Speech is used as a voice frame.
In any of the above aspects, preferably, the dominant sound source decision submodule has a function of determining the dominant data path according to a dominant sound source decision principle.
In any of the above schemes, preferably, the principle of determining the dominant sound source includes:
1) when one path is Speech and the other path is Ambient or Noise, determining the path as the main data path of the current position frame;
2) when one path is Ambient and the other path is Noise, determining the path as the main data path of the current position frame;
3) when both paths are the same kind of frameWhen it is determinedThe channel with the largest value is used as the main data channel of the current position frame.
In any of the above schemes, preferably, the Noise reduction sub-module has a function of obtaining a Noise spectrum characteristic according to Noise audio frames associated before and after the Speech audio frame of the main data path, and effectively suppressing Noise spectrum components of the Speech audio frame in a frequency domain to obtain relatively pure Speech data.
In any of the above aspects, preferably, the operation instruction includes at least one of the valid audio instruction signal, the text instruction signal, or the gesture instruction signal.
In any of the above schemes, preferably, the control instruction includes at least one of a search instruction, a filtering instruction, a caching instruction, a downloading instruction, a storing instruction, and a playing instruction.
In any of the above schemes, preferably, the search instruction is to search preferentially in a local storage, and if not, search in the cloud through the communication component.
In any of the above aspects, preferably, the communication component includes at least one of wifi, wireless, 2G/3G/4G/5G, and GPRS.
In any of the above schemes, preferably, the play instruction refers to playing the cached audio file or the audio file in the local storage through a playback device.
In any of the above schemes, preferably, the play instruction refers to playing the cached audio file or the audio file in the local storage through a playback device.
A third aspect of the invention discloses a sound collection device comprising a housing and further comprising the system of any of the above.
Preferably, the sound collection device is fixedly installed on the intelligent device.
In any one of the above aspects, preferably, the smart device includes: at least one of a smart phone, a smart camera, a smart headset, and other smart devices.
The invention realizes high-definition voice instruction input by processing the audio signal, liberates two hands, and ensures that the wearable equipment is more convenient to apply and more close to the use habit of people.
Drawings
Fig. 1 is a flowchart of a preferred embodiment of an audio playing method according to the present invention.
Fig. 2 is a block diagram of an audio playback system according to a preferred embodiment of the present invention.
Fig. 3 is a schematic cross-sectional view of an embodiment of a bone conduction microphone of an audio playing device according to the present invention.
Fig. 4 is a schematic structural diagram of an embodiment of a smart headset of an audio playing device according to the present invention.
Fig. 5 is a flowchart of an embodiment of a noise reduction method of an audio playing method according to the present invention.
Fig. 6 is a flowchart illustrating an embodiment of a dialect identifying module initializing method for an audio playing method according to the present invention.
Fig. 7 is a flowchart of an embodiment of a dialect identifying method of an audio playing method according to the present invention.
Detailed Description
The invention is further illustrated with reference to the figures and the specific examples.
Example one
As shown in fig. 1 and fig. 2, step 100 is executed to start the input module 200 (including the audio input module 201, the handwriting input module 202, and the keyboard input module 203). Step 110 is executed to determine the input module type. If the input module is an audio input module 201 (including a bone conduction microphone and a microphone), step 120 is performed, and the audio characteristic detection sub-module 211 performs the detection on the input audio signal (including the audio signal collected from the microphoneAnd a second audio signal collected from a bone conduction microphone) for audio characteristic detection (including voice detection, noise detection, and correlation feature extraction). The steps of audio characteristic detection are as follows: 1) extracting audio data with a frame length of 20ms, xi(n) and calculating the average energy EiZero crossing rate ZCRiShort time correlation RiAnd short-time cross-correlation Cij(k), Wherein the content of the first and second substances,2) according to said average energy EiThe zero crossing rate ZCRiThe short-time correlation RiAnd the short-time cross-correlation Cij(k) Calculating the non-silence probability of the current frameAnd probability of speech Wherein the content of the first and second substances,is i channel max (E)i*ZCRi) Is determined by the empirical reference value of (a),is i channel max { max [ R ]i(k)]*max[Cij(k)]An empirical reference value. 3) The method for detecting the audio characteristics also comprises the step of detecting the non-silence probability of the current frame according to the i channelAnd the speech probabilityJudging the type of the current frame, namely whether the current frame is a noise frame, a voice frame or a noise-free environment voice frame,wherein,the method is an empirical value of relevant judgment, the Ambient sound frame without Noise is used as Ambient sound frame, Noise is used as Noise frame, and Speech is used as a voice frame. In step 121, the dominant sound source determination submodule 212 determines whether to use the current frameThe current frame extracted from that way is determined as the dominant sound source of the current position frame. The determination method comprises the following steps: 1) when one path is a Speech voice frame and the other path is an Ambient Noise-free environment voice frame or a Noise frame, determining the path as a main data path of the current position frame; 2) when one path is an Ambient Noise-free sound frame and the other path is a Noise frame, determining the path as a main data path of the current position frame; 3) when both paths are the same kind of frame, determiningThe channel with the largest value is used as the main data path of the current position frame. Step 122 is executed to select the main sound sourceStill containing a small amount of noise data, the noise reduction sub-module 213 derives noise spectral characteristics from noise frames associated before and after the speech frame on the main data path and suppresses the noise spectral components in the frequency domain for the speech frame. Step 123 is executed to output the voice operation instruction.
When the input module type is text input, step 130 is executed to determine the text input type. If the input is handwritten, step 131 is executed, and the handwritten character determination submodule 215 determines the character type of the handwritten input and identifies characters and numbers. In step 132, the handwritten character error correction submodule 216 intelligently corrects the erroneous character according to the character and number obtained from the handwritten character determination submodule 215 to obtain a relatively accurate character instruction, and in step 133, outputs a character operation instruction. If the input is a keyboard input, step 132 is executed, the keyboard character confirmation submodule 218 confirms the input characters and performs intelligent error correction to obtain relatively accurate character instructions, and step 133 is executed to output character operation instructions.
In step 140, the operation instruction parsing module 220 parses the obtained voice operation instruction or text operation instruction, and generates a control instruction (including a search instruction, a filtering instruction, a cache instruction, a download instruction, a storage instruction, a play instruction, etc.). In step 150, the audio file acquiring module 230 executes a control instruction, where the control instruction is preferentially executed in the local storage module 231, and when the local storage module 231 cannot execute the control instruction, the control instruction is executed. The audio file is downloaded through the network module 232. In step 160, the audio file playing module 240 plays the audio file through the audio output device.
Example two
As shown in fig. 3, the housing is denoted by 301, the vibration collector is denoted by 302, the pressure sensor is denoted by 303, the signal processor is denoted by 304, the vibration chamber is denoted by 305, the lead is denoted by 306, the circuit board is denoted by 307, the base is denoted by 308, and the signal collection portion is denoted by 309.
A bone microphone 10 is shown in fig. 1, and comprises a shell 301, a vibration collector 302, a pressure sensor 303, a signal processor 304, a lead 306 and a circuit board 307, wherein the shell 301 and the vibration collector 302 are connected to form a closed space. The circuit board 307 is disposed at the bottom of the housing 301 in the enclosed space, and the processor is disposed on the circuit board 307 and electrically connected to the circuit board 307. The pressure sensor 303 is disposed between the circuit board 307 and the vibration collector 302 in the enclosed space, and is fixedly connected to the housing 301. The pressure sensor 303 is electrically connected to the circuit board 307 by a lead 306. The housing 301 is at least partially made of an elastic material.
The pressure sensor 303 is a downwardly convex curved surface. The non-planar pressure sensor, especially the pressure sensor with the cambered surface, is more sensitive to the perception of sound source vibration, and is beneficial to the collection of the sound source.
The pressure sensor end 312 is connected to a connection portion provided on the housing 301. The connection portion is a recess with which the pressure sensor end 312 is snapped. Preferably, a sealant is applied to the connection portion between the connection portion and the end portion 312 of the pressure sensor for improving the sealing performance of the vibration cavity 305 and reducing or avoiding the sound loss caused by air leakage of the airbag.
The vibration collector 302 comprises a signal collecting part 309 and a first connecting part 310, the housing 301 comprises a second connecting part 311, as shown in fig. 3, the first connecting part 310 and the second connecting part 311 are fixedly connected and connected into a sealed whole through a sealant. The fixed connection is clamping connection. The first connecting portion 310 is a concave portion, and the second connecting portion 311 is a convex portion; or the first connection portion 310 is a male portion and the second connection portion 311 is a female portion. The concave part is clamped with the convex part.
The vibration collector 302 is made of an elastic material. The signal collecting part 309 is composed of a plurality of protrusions protruding upward. The protrusions are connected as a whole. The protrusions are thin-walled arc surfaces. The protrusions are distributed on the surface of the vibration collector 302. A sealed cavity is formed between the vibration collector 302 and at least the pressure sensor 303. The cavity is a vibration cavity 305.
In this embodiment, a base 308 is further included, and the base 308 is integrally connected to the housing 301. The circuit board 307 is disposed on the base 308. The signal processor 304 is disposed on the circuit board 307. The pressure sensor 303 is connected to a circuit board 307 by a wire 306.
EXAMPLE III
As shown in fig. 4, a headset 400 incorporating a sound collection system is shown, including a left earphone 410 and a right earphone 430. The core components of the sound collection system are concentrated in the left earpiece 410, including a 3G/4G network referenced 420, wifi/bluetooth referenced 421, LCD display/touch screen referenced 422, accelerometer/gyroscope referenced 423, GPS referenced 424, bone conduction microphone referenced 425 (left), horn referenced 426 (left), audio signal processing referenced 427 (DAC), local data storage referenced 428 and CPU referenced 429. The 3G/4G network, the wifi/Bluetooth, the LCD display/touch screen, the acceleration sensor/gyroscope, the GPS, the audio signal processing (DAC) and the local data storage are respectively connected with the CPU, and the bone conduction microphone (left) and the loudspeaker (left) are connected with the audio signal processing (DAC).
The right earpiece 430 has some auxiliary components integrated therein, including a speaker (right) reference 440, sensors reference 441 and 443, a touch pad music control reference 442, a bone conduction microphone (right) reference 444, and a battery reference 445. The loudspeaker (right), the sensor, the touch pad music control and the battery are respectively connected with the CPU in the earphone on the left side, and the bone conduction microphone (right) is connected with the loudspeaker (right).
Example four
As shown in fig. 5, step 500 is performed to import the main audio data. Step 510 is executed to retrieve the environment determination data stored in the memory. Step 520 is executed to compare the main audio data with the environment determination data and determine the surrounding noise environment when the main audio is input. Step 530 and step 540 are performed sequentially, and the ambient noise data is retrieved from the memory and compared with the main audio data for a single frame. Step 550 is performed to remove the audio data in the single frame of the main audio data that is the same as the ambient noise data. Step 560 is performed to generate valid audio data without noise.
EXAMPLE five
The audio playing system also comprises a dialect identification module which is used for identifying the dialect collected by the audio input module.
As shown in fig. 6, step 600 and step 610 are executed in sequence, and the dialect recognition module is started to initialize the process and input corresponding voices according to prompts. And executing step 620, connecting to a cloud server through a network module according to the input voice, and checking whether the input voice is stored in an existing dialect library. If it is already stored in the existing dialect library, step 630 is performed to retrieve and download the dialect library. And step 640, inputting corresponding voice according to the prompt, comparing the voice with the dialect library downloaded to the local memory for error correction, and finely adjusting the dialect library according to own habits. Step 650 is performed and saved in local memory.
If the dialect does not exist in the existing dialect library, step 621 is performed, and the voice is input through the audio input module, and the corresponding word is input through the handwriting input module or the keyboard input module. Step 622 is executed, and after all the commonly used word proofreading inputs are completed, the commonly used word proofreading inputs are stored in the local memory. Step 623 is executed to upload to the dialect repository of the cloud server.
EXAMPLE six
As shown in fig. 7, step 600 is executed to input voice through the voice input module. Step 610 is executed to determine whether the dialect corresponding to the voice is stored in the local storage. If the dialect is stored in the local memory, step 620 and step 650 are executed in sequence, and the dialect library in the local memory is retrieved and the dialect comparison is performed. Step 660 is executed to generate a control command according to the dialect comparison result.
If the dialect is not stored in the local storage, step 630 is executed, and dialect searching comparison is performed in the cloud server to determine a suitable dialect library. And sequentially executing the step 640 and the step 650, downloading the corresponding dialect library through the network module, and comparing the dialects. Step 660 is executed to generate a control command according to the dialect comparison result.
For a better understanding of the present invention, the foregoing detailed description has been given in conjunction with specific embodiments thereof, but not with the intention of limiting the invention thereto. Any simple modifications of the above embodiments according to the technical essence of the present invention still fall within the scope of the technical solution of the present invention. In the present specification, each embodiment is described with emphasis on differences from other embodiments, and the same or similar parts between the respective embodiments may be referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The method, apparatus and system of the present invention may be implemented in a number of ways. For example, the methods and systems of the present invention may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (43)

  1. An audio playing method comprises an input module and is characterized by further comprising the following steps:
    step 1: acquiring an operation instruction through the input module;
    step 2: analyzing the operation instruction and generating a control instruction;
    and step 3: executing the control instruction to obtain an audio file;
    and 4, step 4: and playing the audio file.
  2. The method of claim 1, wherein: the input module comprises at least one of an audio input module, a character input module and a gesture input module; the audio input module receives an audio instruction signal and then generates an effective audio instruction signal; the character input module generates a character instruction signal; the gesture input module generates a gesture command signal.
  3. The method of claim 2, wherein: the audio input module includes at least one bone conduction microphone and at least one microphone.
  4. The method of claim 3, wherein: the audio instruction signal includes a first audio signal and a second audio signal.
  5. The method of claim 4, wherein: the first audio signal refers to collecting mechanical waves generated due to vibration of the user's body using the bone conduction microphone.
  6. The method of claim 5, wherein: the second audio signal refers to a sound wave in a time range generated by collecting the mechanical wave with the microphone.
  7. The method of claim 6, wherein: the method for acquiring the operation instruction through the audio input module comprises the following substeps:
    step 11: carrying out audio characteristic detection on the collected audio instruction signal;
    step 12: judging a main sound source;
    step 13: noise is eliminated;
    step 14: outputting the valid audio instruction signal.
  8. The method of claim 7, wherein: the audio characteristic detection includes at least one of voice detection, noise detection, and correlation feature extraction.
  9. The method of claim 8The method of (2), characterized by: the method for detecting the audio characteristics comprises extracting audio data x with the frame length Tms every timei(n) and calculating the average energy EiZero crossing rate ZCRiShort time correlation RiAnd short-time cross-correlation Cij(k),
    Wherein the content of the first and second substances,
  10. the method of claim 9, wherein: the method of audio characteristic detection further comprises detecting the average energy E according to the average energyiThe zero crossing rate ZCRiThe short-time correlation RiAnd the short-time cross-correlation Cij(k) Calculating the non-silence probability of the current frameAnd probability of speech
    Wherein the content of the first and second substances,is i channel max (E)i*ZCRi) Is determined by the empirical reference value of (a),is i channel max { max [ R ]i(k)]*max[Cij(k)]An empirical reference value.
  11. The method of claim 10, wherein the method further comprises the step of applying a voltage to the substrateIn the following steps: the method for detecting the characteristic of the audio frequency further comprises the step of detecting the non-silence probability of the current frame of the i-channel according to the characteristic of the audio frequencyAnd the speech probabilityJudging the type of the current frame, namely whether the current frame is a noise frame, a voice frame or a noise-free environment voice frame,
    wherein,the method is an empirical value of relevant judgment, the Ambient sound frame without Noise is used as Ambient sound frame, Noise is used as Noise frame, and Speech is used as a voice frame.
  12. The method of claim 11, wherein: the step 12 is to determine the main data path according to the principle of determining the main sound source.
  13. The method of claim 12, wherein: the principle of determining the dominant sound source comprises the following steps:
    1) when one path is Speech and the other path is Ambient or Noise, determining the path as the main data path of the current position frame;
    2) when one path is Ambient and the other path is Noise, determining the path as the main data path of the current position frame;
    3) when both paths are the same kind of frame, determiningThe channel with the largest value is used as the main data channel of the current position frame.
  14. The method of claim 13, wherein: and step 13 is to obtain Noise spectrum characteristics according to Noise frames which are related before and after the Speech audio frame of the main data path, and suppress Noise spectrum components of the Speech audio frame in a frequency domain.
  15. The method of claim 14, wherein: the operation instruction comprises at least one of the effective audio instruction signal, the text instruction signal or the gesture instruction signal.
  16. The method of claim 1, wherein: the control instruction comprises at least one of a search instruction, a screening instruction, a caching instruction, a downloading instruction, a storing instruction and a playing instruction.
  17. The method of claim 16, wherein: the search instruction is to search preferentially in the local memory, and if not, the search is performed in the cloud through the communication component.
  18. The method of claim 17, wherein: the communication component comprises at least one of wifi, wireless, 2G/3G/4G/5G and GPRS.
  19. The method of claim 18, wherein: the audio file acquisition means that the cache instruction or the download instruction is executed, and the audio file is obtained from the cloud through the communication assembly.
  20. The method of claim 19, wherein: the playing instruction refers to playing the cached audio file or the audio file in the local memory through the playback equipment.
  21. A sound collection system, comprising an input module, characterized by further comprising the following modules:
    an operation instruction acquisition module: acquiring an operation instruction through the input module;
    the operation instruction analysis module: analyzing the operation instruction and generating a control instruction;
    an audio file acquisition module: the control instruction is executed to acquire an audio file;
    the audio file playing module: the system is used for pushing the valid audio data of the audio file to the terminal equipment.
  22. The sound collection system of claim 21, wherein: the input module comprises at least one of an audio input module, a character input module and a gesture input module; the audio input module receives an audio instruction signal and then generates an effective audio instruction signal; the character input module generates a character instruction signal; the gesture input module generates a gesture command signal.
  23. The sound collection system of claim 22, wherein: the audio input module includes at least one bone conduction microphone and at least one microphone.
  24. The sound collection system of claim 23, wherein: the audio signals include a first audio signal and a second audio signal.
  25. The sound collection system of claim 24, wherein: the first audio signal refers to collecting mechanical waves generated due to vibration of the user's body using the bone conduction microphone.
  26. The sound collection system of claim 25, wherein: the second audio signal refers to a sound wave in a time range generated by collecting the mechanical wave with the microphone.
  27. The sound collection system of claim 26, wherein: the operation instruction acquisition module further comprises the following sub-modules:
    the audio characteristic detection submodule: the audio characteristic detection device is used for detecting the audio characteristics of the acquired audio signals;
    a master sound source judgment submodule: used for judging the main sound source;
    a noise reduction submodule: for eliminating noise;
    the audio instruction output submodule: for outputting the valid audio instruction signal.
  28. The sound collection system of claim 27, wherein: the audio characteristic detection includes at least one of voice detection, noise detection, and correlation feature extraction.
  29. The sound collection system of claim 28, wherein: the method for detecting the audio characteristics comprises extracting audio data x with the frame length Tms every timei(n) and calculating the average energy EiZero crossing rate ZCRiShort time correlation RiAnd short-time cross-correlation Cij(k),
    Wherein the content of the first and second substances,
  30. the sound collection system of claim 29, wherein: the method of audio characteristic detection further comprises detecting the average energy E according to the average energyiThe zero crossing rate ZCRiThe short-time correlation RiAnd the short-time cross-correlation Cij(k) Calculating the non-silence probability of the current frameAnd probability of speech
    Wherein the content of the first and second substances,is i channel max (E)i*ZCRi) Is determined by the empirical reference value of (a),is i channel max { max [ R ]i(k)]*max[Cij(k)]An empirical reference value.
  31. The sound collection system of claim 30, wherein: the method of audio characteristic detection further comprises detecting the non-silence probability of the i-channel current frameAnd the speech probabilityJudging the type of the current frame, namely whether the current frame is a noise frame, a voice frame or a noise-free environment voice frame,
    wherein,the method is an empirical value of relevant judgment, the Ambient sound frame without Noise is used as Ambient sound frame, Noise is used as Noise frame, and Speech is used as a voice frame.
  32. The sound collection system of claim 31, wherein: and the main sound source judging submodule is used for determining a main data path according to a main sound source judging principle.
  33. The sound collection system of claim 32, wherein: the principle of determining the dominant sound source comprises the following steps:
    1) when one path is Speech and the other path is Ambient or Noise, determining the path as the main data path of the current position frame;
    2) when one path is Ambient and the other path is Noise, determining the path as the main data path of the current position frame;
    3) when both paths are the same kind of frame, determiningThe channel with the largest value is used as the main data channel of the current position frame.
  34. The sound collection system of claim 33, wherein: the Noise reduction submodule is used for obtaining Noise spectrum characteristics according to Noise frames which are related before and after the Speech audio frame of the main data path and suppressing Noise spectrum components of the Speech audio frame in a frequency domain.
  35. The sound collection system of claim 34, wherein: the operation instruction comprises at least one of the effective audio instruction signal, the text instruction signal or the gesture instruction signal.
  36. The sound collection system of claim 21, wherein: the control instruction comprises at least one of a search instruction, a screening instruction, a caching instruction, a downloading instruction, a storing instruction and a playing instruction.
  37. The sound collection system of claim 36, wherein: the search instruction is to search preferentially in the local memory, and if not, the search is performed in the cloud through the communication component.
  38. The sound collection system of claim 37, wherein: the communication component comprises at least one of wifi, wireless, 2G/3G/4G/5G and GPRS.
  39. The sound collection system of claim 38, wherein: the audio file acquisition means that the cache instruction or the download instruction is executed, and the audio file is obtained from the cloud through the communication assembly.
  40. The sound collection system of claim 39, wherein: the playing instruction refers to playing the cached audio file or the audio file in the local memory through the playback equipment.
  41. A sound collection device comprising a housing, and further comprising the system of any of claims 21-40.
  42. The sound collection device of claim 41, wherein: the sound collection device is fixedly installed on the intelligent equipment.
  43. A sound collection device according to claim 42, wherein: the smart device includes: at least one of a smart phone, a smart camera, a smart headset, and other smart devices.
CN201780001736.2A 2016-11-03 2017-06-20 Audio playing method, system and device Active CN108475512B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
PCT/IB2016/001579 WO2018083511A1 (en) 2016-11-03 2016-11-03 Audio playing apparatus and method
IBPCT/IB2016/001579 2016-11-03
PCT/CN2017/089207 WO2018082315A1 (en) 2016-11-03 2017-06-20 Audio playing method, system and apparatus

Publications (2)

Publication Number Publication Date
CN108475512A true CN108475512A (en) 2018-08-31
CN108475512B CN108475512B (en) 2023-06-13

Family

ID=62075847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780001736.2A Active CN108475512B (en) 2016-11-03 2017-06-20 Audio playing method, system and device

Country Status (2)

Country Link
CN (1) CN108475512B (en)
WO (2) WO2018083511A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113628622A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Voice interaction method and device, electronic equipment and storage medium
CN116318493A (en) * 2023-03-21 2023-06-23 浙江金华市灵声电子股份有限公司 Emergency broadcast control device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120278070A1 (en) * 2011-04-26 2012-11-01 Parrot Combined microphone and earphone audio headset having means for denoising a near speech signal, in particular for a " hands-free" telephony system
CN103208291A (en) * 2013-03-08 2013-07-17 华南理工大学 Speech enhancement method and device applicable to strong noise environments
CN104144377A (en) * 2013-05-09 2014-11-12 Dsp集团有限公司 Low power activation of voice activated device
CN104780486A (en) * 2014-01-13 2015-07-15 Dsp集团有限公司 Use of microphones with vsensors for wearable devices
CN105097001A (en) * 2014-05-13 2015-11-25 北京奇虎科技有限公司 Audio playing method and apparatus

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2448030Y (en) * 2000-08-23 2001-09-12 吴惠琪 Microphone device with hand-free receiver
US7574008B2 (en) * 2004-09-17 2009-08-11 Microsoft Corporation Method and apparatus for multi-sensory speech enhancement
CN101304619A (en) * 2007-05-11 2008-11-12 鸿富锦精密工业(深圳)有限公司 Wireless earphone and audiofrequency apparatus as well as audio frequency play method
EP2458586A1 (en) * 2010-11-24 2012-05-30 Koninklijke Philips Electronics N.V. System and method for producing an audio signal
WO2014017679A1 (en) * 2012-07-26 2014-01-30 Bang Choon Hee Earphone to be supported on ear and length adjustment knot unit capable of adjusting length of ear ring
TWM445353U (en) * 2012-08-16 2013-01-21 Sound Team Entpr Co Ltd Beanie with earphone
CN104618831A (en) * 2015-01-27 2015-05-13 深圳市百泰实业有限公司 Wireless intelligent headphone
US20160302003A1 (en) * 2015-04-08 2016-10-13 Cornell University Sensing non-speech body sounds

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120278070A1 (en) * 2011-04-26 2012-11-01 Parrot Combined microphone and earphone audio headset having means for denoising a near speech signal, in particular for a " hands-free" telephony system
CN103208291A (en) * 2013-03-08 2013-07-17 华南理工大学 Speech enhancement method and device applicable to strong noise environments
CN104144377A (en) * 2013-05-09 2014-11-12 Dsp集团有限公司 Low power activation of voice activated device
CN104780486A (en) * 2014-01-13 2015-07-15 Dsp集团有限公司 Use of microphones with vsensors for wearable devices
CN105097001A (en) * 2014-05-13 2015-11-25 北京奇虎科技有限公司 Audio playing method and apparatus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113628622A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Voice interaction method and device, electronic equipment and storage medium
CN116318493A (en) * 2023-03-21 2023-06-23 浙江金华市灵声电子股份有限公司 Emergency broadcast control device
CN116318493B (en) * 2023-03-21 2023-10-24 四川贝能达交通设备有限公司 Emergency broadcast control device

Also Published As

Publication number Publication date
WO2018082315A1 (en) 2018-05-11
WO2018083511A1 (en) 2018-05-11
CN108475512B (en) 2023-06-13

Similar Documents

Publication Publication Date Title
US11308977B2 (en) Processing method of audio signal using spectral envelope signal and excitation signal and electronic device including a plurality of microphones supporting the same
US11631402B2 (en) Detection of replay attack
CN107172256B (en) Earphone call self-adaptive adjustment method and device, mobile terminal and storage medium
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN103730122B (en) Voice conversion device and method for converting user voice
US20200258539A1 (en) Sound outputting device including plurality of microphones and method for processing sound signal using plurality of microphones
US10783903B2 (en) Sound collection apparatus, sound collection method, recording medium recording sound collection program, and dictation method
WO2022033556A1 (en) Electronic device and speech recognition method therefor, and medium
JPWO2006011405A1 (en) Digital filtering method, digital filter device, digital filter program, computer-readable recording medium, and recorded device
CN112242149B (en) Audio data processing method and device, earphone and computer readable storage medium
CN110097895B (en) Pure music detection method, pure music detection device and storage medium
CN110992927A (en) Audio generation method and device, computer readable storage medium and computing device
CN108475512B (en) Audio playing method, system and device
CN110853606A (en) Sound effect configuration method and device and computer readable storage medium
CN110910876A (en) Article sound searching device and control method, and voice control setting method and system
CN111048109A (en) Acoustic feature determination method and apparatus, computer device, and storage medium
CN111027675A (en) Automatic adjusting method and system for multimedia playing setting
CN107977187B (en) Reverberation adjusting method and electronic equipment
CN113613157A (en) Earphone, wearing state detection method and device thereof, and storage medium
CN110337030B (en) Video playing method, device, terminal and computer readable storage medium
US10964307B2 (en) Method for adjusting voice frequency and sound playing device thereof
CN112614507A (en) Method and apparatus for detecting noise
WO2020118560A1 (en) Recording method and apparatus, electronic device and computer readable storage medium
CN113409809B (en) Voice noise reduction method, device and equipment
KR20100106738A (en) System and method for cognition of wind using mike

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant