CN109300472A

CN109300472A - A kind of audio recognition method, device, equipment and medium

Info

Publication number: CN109300472A
Application number: CN201811572238.6A
Authority: CN
Inventors: 许辉福; 李峰攀; 袁建强; 伍以文
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-02-01

Abstract

The embodiment of the invention discloses a kind of audio recognition method, device, equipment and media.The described method includes: obtaining voice request；Speech recognition is carried out to the voice request based on preparatory trained speech recognition system, obtains the corresponding intent information of the voice request；Wherein, the trained speech recognition system in advance carries out trained in real time obtain based on the training metadata obtained in real time.The accuracy of speech recognition can be improved by using above-mentioned audio recognition method.

Description

A kind of audio recognition method, device, equipment and medium

Technical field

The present embodiments relate to technical field of voice recognition more particularly to a kind of audio recognition method, device, equipment and Medium.

Background technique

With the development of technology of Internet of things, intelligent control has become developing direction from now on, and voice control technology is intelligence Most important aspect is controlled, with the continuous research and development of the relevant technologies, voice control technology is applied to various electronic equipments and has been taken Obtained initial achievements.

But the recognition accuracy of current speech recognition technology is not also high.

Summary of the invention

The embodiment of the present invention provides a kind of audio recognition method, device, equipment and medium, and language can be improved by the method The accuracy rate of sound identification.

In a first aspect, the embodiment of the invention provides a kind of audio recognition methods, which comprises

Obtain voice request；

Speech recognition is carried out to the voice request based on preparatory trained speech recognition system, the voice is obtained and asks Seek corresponding intent information；

Wherein, the trained speech recognition system in advance is trained in real time based on the training metadata obtained in real time It obtains.

Second aspect, the embodiment of the invention also provides a kind of speech recognition equipment, described device includes:

Module is obtained, for obtaining voice request；

Identification module, for carrying out speech recognition to the voice request based on preparatory trained speech recognition system, Obtain the corresponding intent information of the voice request；

The third aspect, the embodiment of the invention also provides a kind of electronic equipment, which includes:

One or more processors；

Storage device, for storing multiple programs；

When at least one of the multiple program by one or more of processors execute when so that it is one or Multiple processors realize audio recognition method provided by above-mentioned first aspect.

Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer Program, the program realize audio recognition method described in above-mentioned first aspect when being executed by processor.

A kind of audio recognition method provided in an embodiment of the present invention, by based on the training metadata obtained in real time to voice Identifying system is trained, and preparatory trained speech recognition system is obtained, due to the speech recognition system obtained under which It is quasi- to have higher identification to the speech recognition system that each emerging metadata has carried out timely study, therefore obtained under which Exactness carries out voice to the voice request based on the trained speech recognition system in advance when obtaining voice request Identification, obtains the corresponding intent information of the voice request, above-mentioned speech recognition schemes improve the accuracy of speech recognition.

Detailed description of the invention

Fig. 1 is a kind of audio recognition method flow diagram that the embodiment of the present invention one provides；

Fig. 2 is a kind of trained metadata collecting process schematic that the embodiment of the present invention one provides；

Fig. 3 is a kind of structural schematic diagram of speech recognition equipment provided by Embodiment 2 of the present invention；

Fig. 4 is the hardware structural diagram for a kind of electronic equipment that the embodiment of the present invention three provides.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing to of the invention specific real Example is applied to be described in further detail.It is understood that specific embodiment described herein is used only for explaining the present invention, Rather than limitation of the invention.

It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather than Full content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detail At the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart, It is that many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by again It arranges.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing. The processing can correspond to method, function, regulation, subroutine, subprogram etc..

Embodiment one

Fig. 1 is a kind of audio recognition method flow diagram that the embodiment of the present invention one provides, language provided in this embodiment Voice recognition method is suitable for the case where controlling by voice each smart machine, the smart machine for example, intelligence Speaker, smart television, smart phone or intelligent vehicle-carried equipment etc..The audio recognition method is executed by speech recognition equipment, Described device is generally integrated in terminal by the realization of software and/or hardware, such as intelligent sound box, smart television, smart phone Or intelligent vehicle-carried equipment etc..Referring specifically to shown in Fig. 1, the audio recognition method includes the following steps:

Step 110 obtains voice request.

Specifically, can pick up input unit by voice obtains the voice request；Such as it is obtained by voice remote controller The voice request that user issues TV；Or user is obtained to intelligent sound box hair by the phonetic incepting microphone of intelligent sound box Voice request out.

The voice request is different and different according to the equipment of control, such as when the equipment of control is smart television, institute Stating voice request can be " please be turned up/turn down volume ", " please be switched to sports channel " or " please play the pass of certain movie star protagonist In spy war film " etc. be directed to TV functions some requests.When the equipment of control is intelligent sound box, the voice request tool Body can be with music, the on-demand request of video, encyclopaedia, such as " please play happy birthday song " etc. asks for some of function of loudspeaker box It asks.

Step 120 carries out speech recognition to the voice request based on preparatory trained speech recognition system, obtains institute State the corresponding intent information of voice request.

Wherein, the trained speech recognition system in advance is trained in real time based on the training metadata obtained in real time It obtains.With the development of the times, many foreign words with times flavour, buzzword (such as workplace little Bai, small green hand) are by people Be widely used；And making rapid progress with movie and television contents, a large amount of new video datas can be all generated daily；Therefore, in order to mention The accuracy of high speech recognition, speech recognition system need preferentially to the foreign word, buzzword and new video data etc. New content is learnt.Know in view of this, how to obtain trained metadata abundant in time at speech recognition system can be improved The key of other accuracy.

Further, the method also includes: in real time obtain training metadata, specifically include:

Web service framework, which is based on, in target network infrastructure builds trained metadata collecting platform；

By the website trained metadata collecting platform calls application interface API real time access OTT, with realization pair The acquisition of training metadata.

Wherein, the target network infrastructure includes the network infrastructure of the offers such as Amazon, Alibaba.It is described Web service framework for example can be with are as follows: Spring Cloud.The website OTT provides various newest videos, buzzword, external The data service business such as language, therefore by building trained metadata collecting platform in target network infrastructure, it can be achieved that right Newest trained metadata is targetedly collected in time, and then realizes the collection of abundant training metadata.The training member The field information that data include includes: that program title (such as: there are sons and daughters in family, prolong auspiciousness strategy, happy base camp etc.), plot are retouched It states, type (such as variety, describing love affairs, ancient costume etc.), performer, play staff, studio information, image poster, customer rating information, version Information and issuing date etc..The trained metadata includes a variety of sides in one or more language and same language Speech.

Further, pass through the trained metadata collecting platform calls application interface API real time access OTT net It stands, to realize the acquisition to training metadata, comprising:

The website API real time access OTT is called by the trained metadata collecting platform, obtains video based on preset rules Label information；

The metadata of the video is parsed according to the video tab information.

The video type emphasis that each Video service quotient on the website OTT provides is different, and some Video service quotient lay particular emphasis on The service of film, TV play, variety etc.；Some Video service quotient lay particular emphasis on the service of original, information, documentary film etc..The view Frequency marking label information includes the essential informations such as video name, video type and video link.

It is further, described to obtain video tab information based on preset rules, comprising:

Obtain the video tab information of setting quantity；

Alternatively, obtaining video tab information according to the renewal time of video, for example renewal time is obtained every time away from working as Video tab information within the preceding one day time.

Further, the method also includes:

The training metadata obtained in real time is based on different language and generates file destination；Alternatively, the training that will be obtained in real time Metadata is based on various regions dialect and generates file destination；The file destination is uploaded to speech recognition platforms in real time, based on real When the training metadata that obtains speech recognition system is trained in real time.

Wherein, the file destination includes the file based at least two language, such as one is generated based on Chinese File destination, another kind are the file destinations generated based on English；The file destination can also include raw based on the local dialect At file.The file destination specifically can be XML file or Json file.

The file destination is uploaded to speech recognition platforms in real time, is specifically as follows:

The full release of the file destination is uploaded to speech recognition platforms in real time；

Alternatively, the difference metadata between the file destination of version is uploaded to language by the file destination of current version and before Sound identifying platform uploads to avoid the repetition for repeating metadata, saves and upload flow, improves uploading speed；Wherein, the target File includes file version information.

It further, can be to the conjunction of metadata in file destination when speech recognition platforms receive the file destination Method is verified, and corresponding generate indicates that verification passes through or verifies the report of failure, to improve the quality of metadata.Specifically Method of calibration can using CRC algorithm carry out.Speech recognition platforms can also be based further on the metadata foundation transmitted and know Know map, and the knowledge mapping of foundation is bound in the training of speech recognition system, knowledge mapping can be speech recognition process More context relations are introduced, for example, can release the director of the film by film title by knowledge mapping, act the leading role, and can Further by acting the leading role other films releasing the protagonist and drilling.For example, user's request " plays the shadow that Liu Dehua fights about spy Piece " actively recommends the films such as user's " Infernal Affairs ", " nature's mystery Fuchun Village figure " then according to the knowledge mapping of metadata.

Metadata is trained by collecting foreign word, popular word etc., and by being based on each place dialect, each languages to collection The metadata arrived carries out conversion process, and the identification function of speech recognition system can be made to grow with each passing hour, fully understand the voice of user Request, embodies high intelligence degree.

On the application scenarios of smart machine, commonly there is programm name or access title, general language table can not be used It states, such as access HDMI1, YPbPr, Component, AV1 or Composite1；Programm name such as SBS1, Channel7 etc.；Net Station name such as www.sohu.com；www.zaobao.com；Wealth net www.18.com.cn etc., by converging these metadata Always to data collection platform, and uploads to speech recognition platforms and carry out intelligent Understanding, machine training, to improve speech recognition system Recognition accuracy, identified by automatic language, natural language understanding and be intended to output, be conducive to improve voice control TV function Can, accomplish " phonetic function is through ".

A kind of audio recognition method provided in this embodiment, by based on the training metadata obtained in real time to speech recognition System is trained, and preparatory trained speech recognition system is obtained, since the speech recognition system obtained under which is to each Emerging metadata has carried out timely study, therefore the speech recognition system obtained under which has higher identification accurately Degree carries out voice knowledge to the voice request based on the trained speech recognition system in advance when obtaining voice request Not, the corresponding intent information of the voice request is obtained, above-mentioned speech recognition schemes improve the accuracy of speech recognition.

Further, on the basis of the above embodiments, a kind of trained metadata collecting process signal shown in Figure 2 Figure, metadata collecting platform 210 collects video metadata from multiple movie and television contents service providers 200, and is received by exotic vocabulary Collection program 201 is collected exotic vocabulary metadata (such as the words such as mini, taxi), right by popular word collection procedure 202 Popular word metadata (such as workplace green hand) is collected, by smart machine term collection procedure 203 to smart machine term Metadata (such as access HDMI1, YPbPr, Component, AV1 or Composite1；Programm name such as SBS1, Channel7 it) is collected；And exported the metadata come is collected according to setting format, generation meta data file 220 (including base In multilingual meta data file, the meta data file based on more the local dialects), further the meta data file is uploaded to Speech recognition platforms 300, and it is stored in the specified memory space 230 of speech recognition platforms 300, speech recognition platforms 300 are based on The metadata of storage establishes knowledge mapping 240, and is trained in real time in conjunction with knowledge mapping 240 to speech recognition system 250, when 260 when receiving the voice request of user, the voice request is identified by trained speech recognition system 250, Obtain the corresponding intent information 270 of the voice request.

By constructing metadata collecting platform, the video metadata that real-time collecting Video service quotient updates, and press multizone (i.e. multi-party speech), multilingual mode, output meta data file supply speech recognition platforms carry out natural language recognition, natural language Speech understands that realizing enables speech recognition system to be trained study based on new metadata in time, and combines metadata Knowledge mapping is understood, thus the more accurately intention of identification user speech request, more convenient user local language reality The access of existing program and video resource, improves the intelligence degree of speech recognition system, improves user experience.Pass through increase Collection to exotic vocabulary, popular word, website and smart machine the machine control metadata increases speech recognition system training The type of metadata improves the suitable application area of speech recognition system.

Embodiment two

Fig. 3 is a kind of structural schematic diagram of speech recognition equipment provided by Embodiment 2 of the present invention, shown in Figure 3, institute Stating device includes: to obtain module 310 and identification module 320；

Wherein, module 310 is obtained, for obtaining voice request；Identification module 320, for based on preparatory trained language Sound identifying system carries out speech recognition to the voice request, obtains the corresponding intent information of the voice request；Wherein, described Preparatory trained speech recognition system carries out training in real time based on the training metadata obtained in real time and obtains.

Further, described device further include:

Metadata obtains module, for obtaining training metadata in real time；

Wherein, the trained metadata includes video metadata, popular word metadata, exotic vocabulary metadata, Yi Jizhi At least one of energy equipment the machine control metadata.

Further, the metadata acquisition module includes:

Unit is built, is put down for building trained metadata collecting based on Spring Cloud in target network infrastructure Platform；

Acquiring unit, for passing through the trained metadata collecting platform calls application interface API real time access OTT Website, to realize the acquisition to training metadata.

Further, the acquiring unit includes:

Subelement is obtained, for calling the website API real time access OTT by the trained metadata collecting platform, is based on Preset rules obtain video tab information；

Parsing subunit, for parsing the metadata of the video according to the video tab information.

Further, the acquisition subelement is specifically used for:

Obtain the video tab information of setting quantity；

Alternatively, the renewal time according to video obtains video tab information.

Further, described device further include:

Generation module, the training metadata for that will obtain in real time are based on different language and generate file destination；Alternatively, by real When the training metadata that obtains be based on various regions dialect and generate file destination；

Uploading module, for the file destination to be uploaded to speech recognition platforms in real time, based on the instruction obtained in real time Practice metadata to train speech recognition system in real time.

Further, the uploading module is specifically used for:

Alternatively, the difference metadata between the file destination of version is uploaded to language by the file destination of current version and before Sound identifying platform；

Wherein, the file destination includes file version information.

Speech recognition equipment provided in this embodiment, by based on the training metadata obtained in real time to speech recognition system It is trained, obtains preparatory trained speech recognition system, since the speech recognition system obtained under which newly goes out to each The language that the metadata of existing metadata and different language, different geographical dialect has carried out timely study, therefore obtained under which Sound identifying system has higher recognition accuracy, when obtaining voice request, based on the trained speech recognition in advance System carries out speech recognition to the voice request, obtains the corresponding intent information of the voice request, is known by above-mentioned voice Other scheme improves the accuracy of speech recognition.

Embodiment three

Fig. 4 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention five provides.Fig. 4, which is shown, to be suitable for being used in fact The block diagram of the example electronic device 12 of existing embodiment of the present invention.The electronic equipment 12 that Fig. 4 is shown is only an example, no The function and use scope for coping with the embodiment of the present invention bring any restrictions.

As shown in figure 4, electronic equipment 12 is showed in the form of universal computing device.The component of electronic equipment 12 may include But be not limited to: one or more processor or processing unit 16, system storage 28, connect different system components (including System storage 28 and processing unit 16) bus 18.

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Electronic equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be electric The usable medium that sub- equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.

System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (RAM) 30 and/or cache memory 32.Electronic equipment 12 may further include other removable/not removable Dynamic, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for read and write can not Mobile, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in fig 4, Ke Yiti For the disc driver for being read and write to removable non-volatile magnetic disk (such as " floppy disk "), and to moving non-volatile light The CD drive of disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver It can be connected by one or more data media interfaces with bus 18.Memory 28 may include that at least one program produces Product, the program product have one group of (such as acquisition module 310 and identification module 320 of speech recognition equipment) program module, this A little program modules are configured to perform the function of various embodiments of the present invention.

Program with one group of (such as acquisition module 310 and identification module 320 of speech recognition equipment) program module 42/ Utility 40 can store in such as memory 28, and such program module 42 includes but is not limited to operating system, one Or multiple application programs, other program modules and program data, each of these examples or certain combination in may Realization including network environment.Program module 42 usually executes function and/or method in embodiment described in the invention.

Electronic equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 etc.) Communication, can also be enabled a user to one or more equipment interact with the electronic equipment 12 communicate, and/or with make the electricity Any equipment (such as network interface card, modem etc.) that sub- equipment 12 can be communicated with one or more of the other calculating equipment Communication.This communication can be carried out by input/output (I/O) interface 22.Also, electronic equipment 12 can also be suitable by network Orchestration 20 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as internet) Communication.As shown, network adapter 20 is communicated by bus 18 with other modules of electronic equipment 12.Although should be understood that It is not shown in the figure, other hardware and/or software module can be used in conjunction with electronic equipment 12, including but not limited to: microcode is set Standby driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system System etc..

Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize audio recognition method provided by the embodiment of the present invention, this method comprises:

Obtain voice request；

Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize audio recognition method provided by the embodiment of the present invention.

Certainly, it will be understood by those skilled in the art that processor can also realize it is provided by any embodiment of the invention The technical solution of audio recognition method.

Example IV

The embodiment of the present invention four additionally provides a kind of computer readable storage medium, is stored thereon with computer program, should The audio recognition method as provided by the embodiment of the present invention is realized when program is executed by processor, this method comprises:

Obtain voice request；

Certainly, a kind of computer readable storage medium provided by the embodiment of the present invention, the computer program stored thereon The method operation being not limited to the described above, can also be performed the phase in audio recognition method provided by any embodiment of the invention Close operation.

The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any tangible medium for including or store program, which can be commanded execution system, device or device Using or it is in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of audio recognition method characterized by comprising

Obtain voice request；

Speech recognition is carried out to the voice request based on preparatory trained speech recognition system, obtains the voice request pair The intent information answered；

Wherein, the trained speech recognition system in advance is carried out trained in real time based on the training metadata that obtains in real time It arrives.

2. the method according to claim 1, wherein the method also includes:

Training metadata is obtained in real time；

Wherein, the trained metadata includes that video metadata, popular word metadata, exotic vocabulary metadata and intelligence are set At least one of standby the machine control metadata.

3. according to the method described in claim 2, it is characterized in that, the real-time acquisition training metadata, comprising:

By the website trained metadata collecting platform calls application interface API real time access OTT, to realize to training The acquisition of metadata.

4. according to the method described in claim 3, applying journey it is characterized in that, calling by the trained metadata collecting platform The website sequence interface API real time access OTT, to realize the acquisition to training metadata, comprising:

The website API real time access OTT is called by the trained metadata collecting platform, obtains video tab based on preset rules Information；

The metadata of the video is parsed according to the video tab information.

5. according to the method described in claim 4, it is characterized in that, described obtain video tab information, packet based on preset rules It includes:

Obtain the video tab information of setting quantity；

6. according to the method described in claim 2, it is characterized by further comprising:

The training metadata obtained in real time is based on different language and generates file destination；Alternatively, by the first number of the training obtained in real time File destination is generated according to based on various regions dialect；The file destination is uploaded to speech recognition platforms in real time, to be based on obtaining in real time The training metadata taken trains speech recognition system in real time.

7. according to the method described in claim 6, being put down it is characterized in that, the file destination is uploaded to speech recognition in real time Platform, comprising:

Know alternatively, the file destination of current version and before the difference metadata between the file destination of version are uploaded to voice Other platform；

Wherein, the file destination includes file version information.

8. a kind of speech recognition equipment characterized by comprising

Module is obtained, for obtaining voice request；

Identification module is obtained for carrying out speech recognition to the voice request based on preparatory trained speech recognition system The corresponding intent information of the voice request；

9. a kind of electronic equipment, which is characterized in that the electronic equipment further include:

One or more processors；

Storage device, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now audio recognition method as described in any in claim 1-7.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The audio recognition method as described in any in claim 1-7 is realized when execution.