CN109189365A

CN109189365A - A kind of audio recognition method, storage medium and terminal device

Info

Publication number: CN109189365A
Application number: CN201810941587.4A
Authority: CN
Inventors: 肖伟平
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2019-01-11

Abstract

The present invention relates to field of communication technology, a kind of audio recognition method, storage medium and terminal device are proposed.The audio recognition method includes: the permission that microphone input voice is obtained using the power function of html5；User is acquired by the voice signal of the microphone input, obtains target audio file；The target audio file is sent to, third party's interface of speech-recognition services is provided；After third party completes natural language recognition processing, from the speech recognition result of target audio file described in third party's interface.Using this audio recognition method, user directly can complete voice collecting and speech identifying function by browser, do not need installation APP client, reduce product to the dependence of host APP, expanded scene application channel.

Description

A kind of audio recognition method, storage medium and terminal device

Technical field

The present invention relates to field of communication technology more particularly to a kind of audio recognition methods, storage medium and terminal device.

Background technique

Currently, speech recognition technology has been achieved with significant achievement, it is widely used in household electrical appliances, communication, automotive electronics, doctor The multiple fields such as treatment, home services and consumption electronic product.Speech recognition is to allow machine that voice is believed by identifying and understanding Number it is changed into the process of corresponding text or order, generally includes voice collecting, front-end processing, acoustic feature extract, building language The progress speech recognition of sound identification model and etc..

In order to facilitate the use of user, it will usually which voice collecting and speech identifying function are integrated into some APP (Application) in.However, need user that APP client is installed at the terminal in this way, the dependence for APP Property is too strong.

Summary of the invention

In view of this, the embodiment of the invention provides a kind of audio recognition method, storage medium and terminal device, Neng Gou The acquisition and identification that voice is realized on the basis of APP are not depended on.

The embodiment of the present invention in a first aspect, providing a kind of audio recognition method, comprising:

The permission of microphone input voice is obtained using the power function of html5；

User is acquired by the voice signal of the microphone input, obtains target audio file；

The target audio file is sent to, third party's interface of speech-recognition services is provided；

From the speech recognition result of target audio file described in third party's interface.

The second aspect of the embodiment of the present invention, provides a kind of computer readable storage medium, described computer-readable to deposit Storage media is stored with computer-readable instruction, and such as the embodiment of the present invention is realized when the computer-readable instruction is executed by processor First aspect propose audio recognition method the step of.

The third aspect of the embodiment of the present invention, provides a kind of terminal device, including memory, processor and is stored in In the memory and the computer-readable instruction that can run on the processor, the processor executes the computer can Following steps are realized when reading instruction:

The audio recognition method that the embodiment of the present invention proposes includes: to obtain microphone input using the power function of html5 The permission of voice；User is acquired by the voice signal of the microphone input, obtains target audio file；By the target sound Frequency file, which is sent to, provides third party's interface of speech-recognition services；From target audio file described in third party's interface Speech recognition result.Using this audio recognition method, user need to only open browser, i.e., in combination with the power function of html5 Voice collecting and speech identifying function are completed, installation APP client is not needed, reduces product to the dependence of host APP, open up Scene application channel is opened up.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is a kind of flow chart of one embodiment of audio recognition method provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of second embodiment of audio recognition method provided in an embodiment of the present invention；

Fig. 3 is a kind of structure chart of one embodiment of speech recognition equipment provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of terminal device provided in an embodiment of the present invention.

Specific embodiment

The embodiment of the invention provides a kind of audio recognition method, storage medium and terminal devices, can not depend on APP On the basis of realize voice acquisition and identification.

In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field Those of ordinary skill's all other embodiment obtained without making creative work, belongs to protection of the present invention Range.

Referring to Fig. 1, a kind of one embodiment of audio recognition method includes: in the embodiment of the present invention

101, the permission of microphone input voice is obtained using the power function of html5；

Firstly, the browser for the equipment that opens a terminal, obtains microphone permission using the power function of html5.Specifically, can To use the power function Navigator.getUseMedia { audio:true } of html5, function () to prompt user system System needs to obtain the permission of microphone input voice, the wheat after user's license confirmation, in system energy operating terminal equipment Gram wind.

102, acquisition user obtains target audio file by the voice signal of the microphone input；

After getting the microphone permission of terminal device, the voice that user issues can be acquired by microphone and is believed Number, obtain target audio file.Specifically, system can pass through the window.audioContext.createMedi of html5 AStreamSource () power function obtains the target audio file.

103, the target audio file is sent to and third party's interface of speech-recognition services is provided；

After obtaining target audio file, sends it to and third party's interface of speech-recognition services is provided, by third The work of side plug completion speech recognition.

Further, before step 103, can also include:

(1) duration, tone and the volume of the target audio file are extracted；

(2) if the duration, tone or volume do not meet corresponding condition, user is prompted to re-enter voice letter Number；

(3) if the duration, tone and volume meet corresponding condition, 103 are thened follow the steps.

In order to improve the accuracy rate of speech recognition, first certain audio frequency parameters of target audio file can be limited, with Obtain the audio file of high quality.For example, can the parameters such as duration, tone and volume to target audio file limit, If current duration, tone or volume are not met preset condition (for example volume is too small, duration is too short etc.), abandon currently Target audio file prompts user to re-enter voice signal, until obtaining the high quality audio file met the requirements.

Further, whether the duration, tone or volume, which meet corresponding condition, can pass through following steps Determine:

If the duration is less than preset long recording time lower limit, determine that the duration does not meet corresponding condition；

If the tone is more than preset range of pitch section, determine that the tone does not meet corresponding condition；

If the volume is less than preset volume floor, determine that the volume does not meet corresponding condition.

It is arranged in this way, all satisfactory target audio file of long recording time, tone and volume can be obtained, effectively Improve the accuracy rate of subsequent speech recognition.

104, from the speech recognition result of target audio file described in third party's interface.

After the target audio file is sent to the third party's interface for providing speech-recognition services, inserted by third party Part carries out speech recognition to the target audio file, obtains speech recognition result.Then, terminal system connects from the third party again Mouth receives the speech recognition result of the target audio file, to complete the process of a voice collecting and speech recognition.

Referring to Fig. 2, a kind of second embodiment of audio recognition method includes: in the embodiment of the present invention

201, the target text sent by server is received；

Specifically, terminal device can be by the relevant interface of html5 connection server, to obtain target text.Institute Stating target text can be the content for needing user speech to read in a certain operation flow, for example agree to that certain company provides a certain clothes The agreement of business, relief bulletin etc..

202, the target text is shown in terminal interface, and export and be used to indicate described in user reads in the first duration The prompt information of target text；

After receiving target text, the target text is shown in terminal interface, and output is used to indicate user and exists The prompt information of the target text is read in first duration.For example, showing that " my XXX (name) has been read on terminal interface Read company A service agreement, understand protocol contents, agree to all clauses of the service agreement " target text, then pass through language Sound or text instruction " please reading above content in 20 seconds ".

203, the permission of microphone input voice is obtained using the power function of html5；

The related description of step 203 can refer to step 101.

204, it after detecting the signal for waking up microphone, opens microphone and starts timer, acquisition user's input Voice signal, the timing time of the timer are first duration；

After user is ready to, microphone is waken up, starts to read the target text.Terminal system is detecting wake-up Mike It after the signal of wind, opens microphone and starts timer, the timing time of the timer is first duration.In addition, also It can show countdown, on terminal interface so that user holds reading rate.

Further, first duration can be determined by following steps:

(1) character quantity of the target text is counted；

(2) base needed for reading the target text is calculated according to the character quantity and preset benchmark word speed Between punctual；

(3) first duration is obtained multiplied by preset proportionality coefficient with the fiducial time.

For example, benchmark word speed is 2 character per seconds if the character quantity of target text is 200, then benchmark can be calculated Time is 100 seconds.And in order to there are certain spacious and comfortable, later again with the fiducial time multiplied by a proportionality coefficient (such as 1.1), Obtain first duration.

205, after the timing time reaches, the microphone is closed, target audio file is obtained；

After the timing time reaches, the microphone is automatically closed in system, obtains target audio file, the target Audio file records the voice signal that user issues during microphone is opened.

206, the target audio file is sent to and third party's interface of speech-recognition services is provided；

After obtaining target audio file, sends it to and third party's interface of speech-recognition services is provided.City at present The third side plug that speech-recognition services are provided on face differs from one another, and some is absorbed in raising recognition accuracy, and some is absorbed in The identification of certain semantic scene, some are absorbed in the speech recognition under high noise environments.Therefore, in order to improve the standard of speech recognition True rate can select most suitable third side plug to carry out language using certain property parameters of the target audio file as foundation Sound identification.Specifically, step 206 may include:

(1) category of language, application scenarios and the uncommon number of words of the target audio file are determined according to the target text Amount；

(2) noise intensity of the target audio file is extracted；

(3) it is selected from multiple third party's interfaces according to the category of language, application scenarios, uncommon number of words and noise intensity Take target third party's interface；

(4) target audio file is sent to the target third party interface.

Firstly, determining category of language, application scenarios and the rarely used word of the target audio file according to the target text Quantity.Category of language includes Chinese, English etc.；Application scenarios include financial business, smart home, education scene etc.；Rarely used word Quantity refers to the quantity for the rarely used word having in target text.In practical applications, can target text generate when as its set These corresponding attribute informations are set, specifically these attribute informations can be indicated with a string of information with multiple fields, obtain Corresponding attribute information is obtained while target text together.Then, the noise intensity of target audio file, noise intensity are extracted It is also the important parameter for influencing the speech recognition effect of third side plug.Then, according to the category of language, application scenarios, life Rare word quantity and noise intensity choose target third party interface from multiple third party's interfaces.Here according to each different third parties The characteristics of plug-in unit, selects optimal target third party's interface for the target audio file.For example, have third side plug A and Third side plug B, wherein A is suitable for identifying the voice of Chinese class, and B is suitable for identifying the voice of foreign language class, if the target audio is literary The category of language of part is English, then chooses the interface of B as target third party's interface.For another example, there is third side plug C and the Three side plug D, wherein C is suitable for the speech recognition under high noise environments, if the noise intensity of the target audio file is more than one Fixed threshold value then chooses the interface of C as target third party's interface.

207, from the speech recognition result of target audio file described in third party's interface；

After the target audio file is sent to the third party's interface for providing speech-recognition services, inserted by third party Part carries out speech recognition to the target audio file, obtains speech recognition result.

208, institute's speech recognition result and the target text are matched, obtains matching degree；

Received speech recognition result and the target text are matched, matching degree is obtained.The speech recognition knot Fruit is also a text file, it is matched with the target text, matching degree is higher, then shows that speech recognition is more quasi- Really.When being matched, dynamic rules algorithm in the prior art and Keyword-method-arit hmetic can be combined.

209, judge whether the matching degree is greater than first threshold；

After obtaining the matching degree, judge whether the matching degree is greater than first threshold (such as 95%).If the matching Degree is greater than first threshold, shows that speech recognition is accurate, executes step 210；If the matching degree is less than first threshold, show that voice is known Other effect is poor, at this time return step 202, reads the target text again by user, conforms to until obtaining recognition effect The target audio file asked.

Further, after determining that the matching degree is greater than first threshold, before executing step 210, can also include:

It is correct to judge whether the keyword in the target text all identifies；

If all identification correctly thens follow the steps 210, otherwise return step 202.

Keyword in the target text preassigns, such as " name of user ", " time ", the keys such as " agreement " Word, these keywords need all identification correct.

210, the target audio file is committed to the server.

The matching degree is greater than first threshold, shows that the target audio file got is the audio text to conform to quality requirements The target audio file is committed to the server at this time by part, to carry out subsequent operation flow, such as the signing of agreement With the formality that comes into force etc..

The audio recognition method that the embodiment of the present invention proposes includes: the target text for receiving and being sent by server；In terminal Target text described in showing interface, and export and be used to indicate the prompt letter that user reads the target text in the first duration Breath；The permission of microphone input voice is obtained using the power function of html5；After detecting the signal for waking up microphone, open It opens microphone and starts timer, the timing time of the timer is first duration；After the timing time reaches, The microphone is closed, target audio file is obtained；The target audio file is sent to and provides the of speech-recognition services Tripartite's interface；From the speech recognition result of target audio file described in third party's interface；By the speech recognition knot Fruit and the target text are matched, and matching degree is obtained；Judge whether the matching degree is greater than first threshold, if then by institute It states target audio file and is committed to the server.The voice messaging that the present embodiment proposes a kind of acquisition user is done as business The method for managing voucher, user directly can complete voice collecting and speech identifying function by browser, not need installation APP visitor Family end reduces product to the dependence of host APP, has expanded scene application channel.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

A kind of audio recognition method is essentially described above, a kind of speech recognition equipment will be described in detail below.

Referring to Fig. 3, a kind of one embodiment of speech recognition equipment includes: in the embodiment of the present invention

Microphone authority acquiring module 301 obtains the permission of microphone input voice using the power function of html5；

Speech signal collection module 302 obtains target for acquiring user by the voice signal of the microphone input Audio file；

Audio file sending module 303 provides the of speech-recognition services for the target audio file to be sent to Tripartite's interface；

Recognition result receiving module 304, for knowing from the voice of target audio file described in third party's interface Other result.

Further, the speech recognition equipment can also include:

Target text receiving module, for receiving the target text sent by server；

Target text display module, for showing the target text in terminal interface, and exporting that being used to indicate user exists The prompt information of the target text is read in first duration；

Matching module obtains matching degree for matching institute's speech recognition result and the target text；

Audio file submits module, if being greater than first threshold for the matching degree, the target audio file is mentioned It hands over to the server.

Further, the speech signal collection module may include:

Timing start unit is adopted for after detecting the signal for waking up microphone, opening microphone and starting timer Collect the voice signal of user's input, the timing time of the timer is first duration；

Microphone closing unit, for closing the microphone after timing time reaches；

The speech signal collection module can also include:

Character statistic unit, for counting the character quantity of the target text；

Fiducial time computing unit, for being calculated according to the character quantity and preset benchmark word speed and reading institute Fiducial time needed for stating target text；

Timing time computing unit is used for the fiducial time multiplied by preset proportionality coefficient, when obtaining described first It is long.

Further, the audio file sending module may include:

Audio frequency parameter determination unit, for determined according to the target text target audio file category of language, Application scenarios and uncommon number of words；

Noise intensity extraction unit, for extracting the noise intensity of the target audio file；

Third party's interface selection unit, for according to the category of language, application scenarios, uncommon number of words and noise intensity Target third party interface is chosen from multiple third party's interfaces；

Audio file transmission unit, for the target audio file to be sent to the target third party interface.

Further, the speech recognition equipment can also include:

Audio frequency parameter extraction module, for extracting duration, tone and the volume of the target audio file；

First condition judgment module prompts if not meeting corresponding condition for the duration, tone or volume User re-enters voice signal；

Second condition judgment module executes if meeting corresponding condition for the duration, tone and volume The step of target audio file is sent to the third party's interface for providing speech-recognition services and subsequent step.

The embodiment of the present invention also provides a kind of computer readable storage medium, and the computer-readable recording medium storage has Computer-readable instruction realizes any one language indicated such as Fig. 1 or Fig. 2 when the computer-readable instruction is executed by processor The step of voice recognition method.

The embodiment of the present invention also provides a kind of terminal device, including memory, processor and is stored in the memory In and the computer-readable instruction that can run on the processor, the processor execute real when the computer-readable instruction Now such as the step of any one audio recognition method of Fig. 1 or Fig. 2 expression.

Fig. 4 is the schematic diagram for the terminal device that one embodiment of the invention provides.As shown in figure 4, the terminal of the embodiment is set Standby 4 include: processor 40, memory 41 and are stored in the meter that can be run in the memory 41 and on the processor 40 Calculation machine readable instruction 42.The processor 40 realizes above-mentioned each audio recognition method when executing the computer-readable instruction 42 Step in embodiment, such as step 101 shown in FIG. 1 is to 104.Alternatively, the processor 40 execute it is described computer-readable The function of each module/unit in above-mentioned each Installation practice, such as the function of module 301 to 304 shown in Fig. 3 are realized when instructing 42 Energy.

Illustratively, the computer-readable instruction 42 can be divided into one or more module/units, one Or multiple module/units are stored in the memory 41, and are executed by the processor 40, to complete the present invention.Institute Stating one or more module/units can be the series of computation machine readable instruction section that can complete specific function, the instruction segment For describing implementation procedure of the computer-readable instruction 42 in the terminal device 4.

The terminal device 4 can be the calculating such as desktop PC, notebook, palm PC and cloud terminal device and set It is standby.The terminal device 4 may include, but be not limited only to, processor 40, memory 41.It will be understood by those skilled in the art that figure 4 be only the example of terminal device 4, does not constitute the restriction to terminal device 4, may include than illustrating more or fewer portions Part perhaps combines certain components or different components, such as the terminal device 4 can also include input-output equipment, net Network access device, bus etc..

The processor 40 can be central processing unit (CentraL Processing Unit, CPU), can also be Other general processors, digital signal processor (DigitaL SignaL Processor, DSP), specific integrated circuit (AppLication Specific Integrated Circuit, ASIC), ready-made programmable gate array (FieLd- ProgrammabLe Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

The memory 41 can be the internal storage unit of the terminal device 4, such as the hard disk or interior of terminal device 4 It deposits.The memory 41 is also possible to the External memory equipment of the terminal device 4, such as be equipped on the terminal device 4 Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure DigitaL, SD) card dodge Deposit card (FLash Card) etc..Further, the memory 41 can also both include the storage inside list of the terminal device 4 Member also includes External memory equipment.The memory 41 is for storing the computer-readable instruction and terminal device institute Other programs and data needed.The memory 41 can be also used for temporarily storing the number that has exported or will export According to.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, terminal device or the network equipment etc.) executes each embodiment the method for the present invention All or part of the steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-OnLy Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journey The medium of sequence code.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although referring to before Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of audio recognition method characterized by comprising

2. audio recognition method according to claim 1, which is characterized in that obtaining wheat using the power function of html5 Before the permission of gram wind input voice, further includes:

Receive the target text sent by server；

The target text is shown in terminal interface, and output is used to indicate user and reads the target text in the first duration Prompt information；

After the speech recognition result from target audio file described in third party's interface, further includes:

Institute's speech recognition result and the target text are matched, matching degree is obtained；

If the matching degree is greater than first threshold, the target audio file is committed to the server.

3. audio recognition method according to claim 2, which is characterized in that the acquisition user is defeated by the microphone The voice signal entered includes:

After detecting the signal for waking up microphone, opens microphone and start timer, the voice signal of acquisition user's input, The timing time of the timer is first duration；

After the timing time reaches, the microphone is closed；

Wherein, first duration is determined by following steps:

Count the character quantity of the target text；

Fiducial time needed for reading the target text is calculated according to the character quantity and preset benchmark word speed；

With the fiducial time multiplied by preset proportionality coefficient, first duration is obtained.

4. audio recognition method according to claim 2, which is characterized in that described to be sent to the target audio file There is provided speech-recognition services third party's interface include:

Category of language, application scenarios and the uncommon number of words of the target audio file are determined according to the target text；

Extract the noise intensity of the target audio file；

Target is chosen from multiple third party's interfaces according to the category of language, application scenarios, uncommon number of words and noise intensity Third party's interface；

The target audio file is sent to the target third party interface.

5. audio recognition method according to any one of claim 1 to 4, which is characterized in that by the target audio File is sent to before the third party's interface for providing speech-recognition services, further includes:

Extract duration, tone and the volume of the target audio file；

If the duration, tone or volume do not meet corresponding condition, user is prompted to re-enter voice signal；

If the duration, tone and volume meet corresponding condition, the target audio file is sent to by execution The step of third party's interface of speech-recognition services is provided and subsequent step.

6. audio recognition method according to claim 5, which is characterized in that whether the duration, tone or volume meet Corresponding condition is determined by following steps:

7. a kind of computer readable storage medium, the computer-readable recording medium storage has computer-readable instruction, special Sign is, realizes that voice described in any one of claims 1 to 6 such as is known when the computer-readable instruction is executed by processor The step of other method.

8. a kind of terminal device, including memory, processor and storage are in the memory and can be on the processor The computer-readable instruction of operation, which is characterized in that the processor realizes following step when executing the computer-readable instruction It is rapid:

9. terminal device according to claim 8, which is characterized in that obtaining microphone using the power function of html5 It inputs before the permission of voice, further includes:

Receive the target text sent by server；

10. terminal device according to claim 9, which is characterized in that described be sent to the target audio file mentions Third party's interface for speech-recognition services includes:

Extract the noise intensity of the target audio file；

The target audio file is sent to the target third party interface.