CN110232919A

CN110232919A - Real-time voice stream extracts and speech recognition system and method

Info

Publication number: CN110232919A
Application number: CN201910533135.7A
Authority: CN
Inventors: 冀瑞国; 孙思明; 秦垠峰; 闫冰
Original assignee: Beijing Zhi He Dafang Technology Co Ltd
Current assignee: Beijing Zhi He Dafang Technology Co Ltd
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2019-09-13

Abstract

The embodiment of the invention discloses a kind of extractions of voice flow and speech recognition system and method, and the system comprises audio collection devices and automatic speech recognition module；Audio collection device is connected with automatic speech recognition module；Audio collection device is placed on the first data line and the second data line, wherein the first data line connection verbal system and microphone, the second data line connect verbal system and loudspeaker；For audio collection device for acquiring the first audio that microphone is sent to verbal system, audio collection device is also used to acquire the second audio that verbal system is sent to loudspeaker；First audio and the second audio are sent to automatic speech recognition module by audio collection device；After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to user terminal and is shown.The embodiment of the present invention, which has the advantages that, to economize on resources, and system docking is easy to operate, and ASR identification engine can be used at any time.

Description

Real-time voice stream extracts and speech recognition system and method

Technical field

The present embodiments relate to artificial intelligence and big data analysis technical fields, and in particular to a kind of real-time voice stream mentions It takes and speech recognition system and method.

Background technique

Automatic speech recognition technology (ASR, Automatic Speech Recognition) is a kind of voice by people turn It is changed to the technology of text.Speech recognition is the field of a multi-crossed disciplines, it and acoustics, phonetics, linguistics, digital signal Numerous subjects such as treatment theory, information theory, computer science are closely coupled.Due to the diversity and complexity of voice signal, language Sound identifying system can only obtain satisfied performance under certain restrictive condition, can be only applied to certain specific fields in other words It closes.The performance of speech recognition system is approximately dependent on following 4 class factor: the size of 1. identification vocabularies and the complexity of voice； 2. the quality of voice signal；3. single speaker or more speakers；4. hardware.

Existing ASR speech recognition engine needs to configure on corresponding server, cumbersome in actually docking, takes When it is laborious.

Summary of the invention

For this purpose, the embodiment of the present invention provide a kind of real-time voice stream extract with speech recognition system and method, it is existing to solve Have in technology due to ASR speech recognition engine need to configure it is cumbersome caused by corresponding server, it is time-consuming and laborious The problem of.

To achieve the goals above, the embodiment of the present invention provides the following technical solutions:

The embodiment of the present invention provides a kind of voice flow and extracts and speech recognition system, including audio collection device and automatic speech Identification module；

Audio collection device is connected with automatic speech recognition module；

Audio collection device is placed on the first data line and the second data line, wherein the first data line connection verbal system and wheat Gram wind, the second data line connect verbal system and loudspeaker；

For acquiring the first audio that microphone is sent to verbal system, audio collection device is also used to acquire audio collection device Verbal system is sent to the second audio of loudspeaker；

First audio and the second audio are sent to automatic speech recognition module by audio collection device；

After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to use Family terminal is shown, and speech recognition engine technical software is wherein embedded in automatic speech recognition module.

Further, audio is not generated on the first data line and/or the second data line, then audio collection device does not acquire sound Frequently；

First data line and/or the second data line generate audio, then start to acquire the first audio and/or the second audio.

Further, collected first audio and the second audio transcoding are 8kHz sample rate, 16bit by audio collection device The audio data of sampling depth, WAV format, and the audio data after transcoding is sent to automatic speech recognition module.

Further, recognition result is sent to back end interface by automatic speech recognition module；The knowledge that back end interface will receive Other result is sent to user terminal and is shown.

It further, further include display terminal, for being shown to recognition result.

Further, verbal system is following at least one: fixed line base, mobile phone, computer and tablet computer.

Further, microphone is earphone microphone；Correspondingly, loudspeaker is earpiece speaker.

The embodiment of the present invention provides a kind of voice flow and extracts and audio recognition method, comprising the following steps:

Acquisition microphone is sent to the first audio of verbal system；

Acquisition verbal system is sent to the second audio of loudspeaker；

First audio and the second audio are sent to automatic speech recognition module；

After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to use Family terminal is shown.

The embodiment of the present invention provides a kind of electronic equipment, including memory, processor and storage are on a memory and can be The computer program run on processor, the processor realize above-mentioned voice flow extraction and speech recognition when executing described program The step of method.

The embodiment of the present invention provides a kind of non-transient computer readable storage medium, is stored thereon with computer program, It is characterized in that, which realizes that above-mentioned voice flow extracts and the step of audio recognition method when being executed by processor.

The embodiment of the present invention provide a kind of voice flow extract with speech recognition system and method, the system comprises audios to adopt Storage and automatic speech recognition module；Audio collection device is connected with automatic speech recognition module；Audio collection device is placed in the first number According on line and the second data line, wherein the first data line connects verbal system and microphone, the second data line connects verbal system And loudspeaker；For acquiring the first audio that microphone is sent to verbal system, audio collection device is also used to adopt audio collection device Collection verbal system is sent to the second audio of loudspeaker；First audio and the second audio are sent to automatic speech by audio collection device Identification module；After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to User terminal is shown.

The embodiment of the present invention has the advantages that speech recognition engine (ASR) and the installation of specific earphone integrated preposition, nothing Speech recognition engine (ASR) private server need to be configured.It economizes on resources；System docking is easy to operate, and ASR identification engine at any time may be used With.

Detailed description of the invention

It, below will be to embodiment party in order to illustrate more clearly of embodiments of the present invention or technical solution in the prior art Formula or attached drawing needed to be used in the description of the prior art are briefly described.It should be evident that the accompanying drawings in the following description is only It is merely exemplary, it for those of ordinary skill in the art, without creative efforts, can also basis The attached drawing of offer, which is extended, obtains other implementation attached drawings.

Structure depicted in this specification, ratio, size etc., only to cooperate the revealed content of specification, for Those skilled in the art understands and reads, and is not intended to limit the invention enforceable qualifications, therefore does not have technical Essential meaning, the modification of any structure, the change of proportionate relationship or the adjustment of size are not influencing the function of the invention that can be generated Under effect and the purpose that can reach, should all still it fall in the range of disclosed technology contents obtain and can cover.

Fig. 1 is that a kind of voice flow provided in an embodiment of the present invention extracts and speech recognition system overall structure diagram；

Fig. 2 is that a kind of voice flow extraction provided in an embodiment of the present invention is shown with speech recognition system sound intermediate frequency connecting wire structure It is intended to；

Fig. 3 is that a kind of voice flow provided in an embodiment of the present invention extracts and audio recognition method overall flow schematic diagram；

Fig. 4 is a kind of electronic equipment structural schematic diagram provided in an embodiment of the present invention.

Specific embodiment

Embodiments of the present invention are illustrated by particular specific embodiment below, those skilled in the art can be by this explanation Content disclosed by book is understood other advantages and efficacy of the present invention easily, it is clear that described embodiment is the present invention one Section Example, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not doing Every other embodiment obtained under the premise of creative work out, shall fall within the protection scope of the present invention.

To solve at least one technical problem in the prior art, the embodiment of the present invention provides a kind of voice flow and extracts and language Sound identifying system.As shown in Figure 1, it includes that audio collection device 11 and automatic speech are known that the voice flow, which is extracted with speech recognition system, Other module；

Audio collection device 11 is connected with automatic speech recognition module；

Audio collection device 11 is placed on the first data line and the second data line, wherein the first data line connection verbal system and Microphone, the second data line connect verbal system and loudspeaker；

For acquiring the first audio that microphone is sent to verbal system, audio collection device 11 is also used to audio collection device 11 Acquisition verbal system is sent to the second audio of loudspeaker；

First audio and the second audio are sent to automatic speech recognition module by audio collection device 11；

After automatic speech recognition module 12 carries out speech recognition to the first audio and the second audio, recognition result is sent to User terminal is shown, and speech recognition engine technical software is wherein embedded in automatic speech recognition module.

Automatic speech recognition module is the hardware module of internal burning speech recognition engine technical software in the prior art, is used It is identified in carrying out voice to the first audio and the second audio, text information corresponding to output language.

Wherein, it should be noted that as shown in Fig. 2, this implementation data wire part is divided into special 3.5mm data line.It is special There are two 3.5mm plug (P1, P2) and a 3.5mm sockets (P3) for 3.5mm data line.Wherein P1 connection computer, phone etc. are set It is standby, P2 connection audio collection device, P3 connection earphone.

It should be noted that then audio collection device is not adopted without generating audio on the first data line and/or the second data line Collect audio；First data line and/or the second data line generate audio, then start to acquire the first audio and/or the second audio.

Further, collected first audio and the second audio transcoding are that 8kHz sample rate, 16bit are adopted by audio collection device The audio data of sample depth, WAV format, and the audio data after transcoding is sent to automatic speech recognition module.

Also further, recognition result is sent to back end interface by automatic speech recognition module；The knowledge that back end interface will receive Other result is sent to user terminal and is shown.

Further, the system also includes display terminals, for being shown to recognition result.

The embodiment of the present invention provides a kind of voice flow and extracts and speech recognition system, the system comprises audio collection device and Automatic speech recognition module；Audio collection device is connected with automatic speech recognition module；Audio collection device be placed in the first data line and On second data line, wherein the first data line connection verbal system and microphone, the second data line connect verbal system and loudspeaking Device；For audio collection device for acquiring the first audio that microphone is sent to verbal system, audio collection device is also used to acquire call Equipment is sent to the second audio of loudspeaker；First audio and the second audio are sent to automatic speech recognition mould by audio collection device Block；After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to user's end End is shown.The embodiment of the present invention has the advantages that speech recognition engine (ASR) and specific earphone integrated preposition peace Dress, without configuring speech recognition engine (ASR) private server.It economizes on resources；System docking is easy to operate, and ASR identifies engine It can be used at any time.

On the basis of the above embodiment of the present invention, a kind of voice flow is provided and is extracted and speech recognition system, the first data Without generating audio on line and/or the second data line, then audio collection device does not acquire audio；

On the basis of the above embodiment of the present invention, a kind of voice flow is provided and is extracted and speech recognition system, audio collection Device by collected first audio and the second audio transcoding be 8kHz sample rate, 16bit sampling depth, WAV format audio number According to, and the audio data after transcoding is sent to automatic speech recognition module.

On the basis of the above embodiment of the present invention, a kind of voice flow is provided and is extracted and speech recognition system, automatic speech Recognition result is sent to back end interface by identification module；The recognition result received is sent to user terminal and opened up by back end interface Show.

On the basis of the above embodiment of the present invention, a kind of voice flow is provided and is extracted and speech recognition system, the system It further include display terminal, for being shown to recognition result.

On the basis of the above embodiment of the present invention, a kind of voice flow is provided and is extracted and speech recognition system, verbal system For following at least one: fixed line base, mobile phone, computer and tablet computer.

On the basis of the above embodiment of the present invention, provide that a kind of voice flow extracts and speech recognition system, microphone are Earphone microphone；Correspondingly, loudspeaker is earpiece speaker.

To solve at least one technical problem in the prior art, as shown in Figure 1, the embodiment of the present invention provides a kind of voice Stream extracts and audio recognition method, comprising the following steps:

Step S1, acquisition microphone are sent to the first audio of verbal system.

Step S1 ', acquisition verbal system are sent to the second audio of loudspeaker.

First audio and the second audio are sent to automatic speech recognition module by step S2.

Step S3, after automatic speech recognition module carries out speech recognition to the first audio and the second audio, by recognition result User terminal is sent to be shown.

The embodiment of the present invention provides a kind of voice flow and extracts and audio recognition method, which comprises acquisition microphone It is sent to the first audio of verbal system；Acquisition verbal system is sent to the second audio of loudspeaker；By the first audio and second Audio is sent to automatic speech recognition module；Automatic speech recognition module carries out speech recognition to the first audio and the second audio Afterwards, recognition result user terminal is sent to be shown.The embodiment of the present invention has the advantages that speech recognition engine (ASR) it is installed with specific earphone integrated preposition, without configuring speech recognition engine (ASR) private server.It economizes on resources；System Docking operation of uniting is easy, and ASR identification engine can be used at any time.

Fig. 4 illustrates the entity structure schematic diagram of a kind of electronic equipment, as shown in figure 4, the electronic equipment may include: place Manage device (processor) 410, communication interface (Communications Interface) 420,430 He of memory (memory) Communication bus 440, wherein processor 410, communication interface 420, memory 430 complete mutual lead to by communication bus 440 Letter.Processor 410 can call the logical order in memory 430, to execute following method: acquisition microphone is sent to call First audio of equipment；Acquisition verbal system is sent to the second audio of loudspeaker；First audio and the second audio are sent to Automatic speech recognition module；After automatic speech recognition module carries out speech recognition to the first audio and the second audio, identification is tied Fruit is sent to user terminal and is shown.

In addition, the logical order in above-mentioned memory 430 can be realized by way of SFU software functional unit and conduct Independent product when selling or using, can store in a computer readable storage medium.Based on this understanding, originally Substantially the part of the part that contributes to existing technology or the technical solution can be in other words for the technical solution of invention The form of software product embodies, which is stored in a storage medium, including some instructions to So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation of the present invention The all or part of the steps of example the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various It can store the medium of program code.

On the other hand, the embodiment of the present invention also provides a kind of non-transient computer readable storage medium, is stored thereon with meter Calculation machine program, the computer program are implemented to carry out the transmission method of the various embodiments described above offer when being executed by processor, such as It include: the first audio for acquiring microphone and being sent to verbal system；Acquisition verbal system is sent to the second audio of loudspeaker；It will First audio and the second audio are sent to automatic speech recognition module；Automatic speech recognition module is to the first audio and the second audio After carrying out speech recognition, recognition result is sent to user terminal and is shown.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Although above having used general explanation and specific embodiment, the present invention is described in detail, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements without departing from theon the basis of the spirit of the present invention are fallen within the scope of the claimed invention.

Claims

1. a kind of voice flow extracts and speech recognition system, which is characterized in that including audio collection device and automatic speech recognition mould Block；

Audio collection device is placed on the first data line and the second data line, wherein the first data line connection verbal system and Mike Wind, the second data line connect verbal system and loudspeaker；

For audio collection device for acquiring the first audio that microphone is sent to verbal system, audio collection device is also used to acquire call Equipment is sent to the second audio of loudspeaker；

After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to user's end End is shown, and speech recognition engine technical software is wherein embedded in automatic speech recognition module.

2. voice flow according to claim 1 extracts and speech recognition system, which is characterized in that the first data line and/or Without generating audio on second data line, then audio collection device does not acquire audio；

3. voice flow according to claim 1 extracts and speech recognition system, which is characterized in that audio collection device will acquire The first audio arrived and the second audio transcoding be 8kHz sample rate, 16bit sampling depth, WAV format audio data, and will turn Audio data after code is sent to automatic speech recognition module.

4. voice flow according to claim 1 extracts and speech recognition system, which is characterized in that automatic speech recognition module Recognition result is sent to back end interface；The recognition result received is sent to user terminal and is shown by back end interface.

5. voice flow according to claim 1 extracts and speech recognition system, which is characterized in that the system also includes aobvious Show terminal, for being shown to recognition result.

6. voice flow according to claim 1 extracts and speech recognition system, which is characterized in that verbal system be with down toward Few one kind: fixed line base, mobile phone, computer and tablet computer.

7. voice flow according to claim 1 extracts and speech recognition system, which is characterized in that microphone is headset Mike Wind；Correspondingly, loudspeaker is earpiece speaker.

8. a kind of voice flow extracts and audio recognition method, which comprises the following steps:

Acquisition microphone is sent to the first audio of verbal system；

Acquisition verbal system is sent to the second audio of loudspeaker；

After automatic speech recognition module carries out speech recognition to the first audio and the second audio, recognition result is sent to user's end End is shown.

9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes voice flow extraction as claimed in claim 8 and voice when executing described program The step of recognition methods.

10. a kind of non-transient computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer The step of voice flow extraction as claimed in claim 8 and audio recognition method are realized when program is executed by processor.