CN107786549B

CN107786549B - Adding method, device, system and the computer-readable medium of audio file

Info

Publication number: CN107786549B
Application number: CN201710958076.9A
Authority: CN
Inventors: 姚聪
Original assignee: Beijing Megvii Technology Co Ltd; Beijing Maigewei Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd; Beijing Maigewei Technology Co Ltd
Priority date: 2017-10-16
Filing date: 2017-10-16
Publication date: 2019-10-29
Anticipated expiration: 2037-10-16
Also published as: CN107786549A

Abstract

The present invention provides a kind of adding method of audio file, device, system and computer-readable mediums, are related to the technical field of multimedia messages, this method comprises: obtaining image information to be identified；Gestures detection and identification are carried out in image information, to obtain the gesture information of gesture, wherein gesture information includes at least one of: the location information of gesture, the type information of gesture, the real time duration of gesture；And based on the determining audio file to match with gesture of gesture information, to add audio file in image information, the present invention alleviates the technical issues of live video in the prior art or short-sighted frequency can not be based on addition of the gesture identification to carry out audio special efficacy.

Description

Adding method, device, system and the computer-readable medium of audio file

Technical field

The present invention relates to the technical field of multimedia messages, more particularly, to a kind of adding method of audio file, device, System and computer-readable medium.

Background technique

With the fast development of internet, user prefers to carry out communication connection by network.In recent years, more popular A kind of exchange way be exactly network direct broadcasting.Network main broadcaster can carry out real-time live broadcast to spectators by network direct broadcasting platform.It removes Except direct-seeding, the short-sighted frequency prerecorded can also be uploaded in network by user so that every spectators to its into Row program request and browsing.It is being broadcast live under short video scene, facial expression and gesture are all most common in addition to voice and text One of exchange and interaction forms.For example, main broadcaster can show happy expression, angry expression, surprised expression etc. is various Expression；Main broadcaster can also show various gestures, for example, love, OK, triumph, the gestures such as thumb up.Above-mentioned facial expression Often make live streaming and short-sighted frequency that more there is interest and attraction with gesture, but the prior art does not propose also for gesture Identification to increase for live video the scheme of special efficacy (including audio and animation).

For above-mentioned problem, currently no effective solution has been proposed.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of adding method of audio file, device, system and computers Readable medium can not carry out audio special efficacy based on gesture identification to alleviate live video in the prior art or short-sighted frequency The technical issues of addition.

In a first aspect, the embodiment of the invention provides a kind of adding methods of audio file, comprising: obtain figure to be identified As information；Gestures detection and identification are carried out in described image information, to obtain the gesture information of gesture, wherein the hand Gesture information includes at least one of: the location information of the gesture, the type information of the gesture, and the gesture is held in real time The continuous time；And based on the determining audio file to match with the gesture of the gesture information, in described image information Add the audio file.

Further, gestures detection and identification are carried out in the progress of described image information, to obtain the hand of the gesture Gesture information includes: the type letter for the location information and gesture for detecting and identifying the gesture in described image information Breath；And the real time duration of the gesture is determined based on the location information and/or the type information.

Further, the location information of the gesture and the class of the gesture are detected and identified in described image information Type information include: the gesture is detected in each picture frame of described image information by first nerves network model, and Detect the location information of the gesture；The hand detected in each described image frame is identified by nervus opticus network model The type information of gesture.

Further, the location information of the gesture and the class of the gesture are detected and identified in described image information Type information include: by third nerve network model in all images frame in described image information every N frame image detection The location information of the gesture；Pass through the type information for the gesture that the fourth nerve network model recognition detection arrives.

Further, the method also includes: obtain training sample, wherein include comprising multiple in the training sample The image of gesture, the image of the multiple gesture are every kind of gesture acquired image at different conditions, and in each image The position of mark good hand's gesture and/or type in advance；The neural network model is trained based on the training sample, is obtained Neural network model after training.

Further, the real time duration of the gesture is determined based on the location information and/or the type information Include: in described image information real-time statistics include incidence relation target image frame, wherein the incidence relation be based on The relationship that the location information and/or the type information determine；The duration of the target image frame is determined as described The real time duration of gesture.

Further, the incidence relation includes: the gesture-type of gesture and hand in the second picture frame in the first picture frame The gesture-type of gesture is identical；And/or the hand gesture location of gesture and gesture described in the second picture frame in the first image frame Registration between hand gesture location is greater than default registration；Wherein, the first image frame and second picture frame are described The adjacent picture frame of any two in target image frame.

It further, include: in the reality based on the determining audio file to match with the gesture of the gesture information When the duration be greater than preset time in the case where, in the database search and the type information and/or location information phase The audio file matched.

Further, it includes: to regard in described image information for live streaming that the audio file is added in described image information In the case where the image information of frequency, the audio file is added in described image information；It is non-straight in described image information In the case where the image information for broadcasting video, the audio file is embedded on the target position of described image information and is broadcast It puts, wherein the target position is at the time of the gesture trigger audio file plays.

Second aspect, the embodiment of the present invention also provide a kind of adding set of audio file, comprising: acquiring unit is used for Obtain image information to be identified；Trace trap unit, for carrying out gestures detection and identification in described image information, with Obtain the gesture information of the gesture, wherein the gesture information includes at least one of: the location information of the gesture, The type information of the gesture, the real time duration of the gesture；Determination unit, for based on the gesture information determine and The audio file that the gesture matches, to add the audio file in described image information.

The third aspect, the embodiment of the present invention also provide a kind of add-on system of audio file, the system comprises: image is adopted Acquisition means, processor and storage device；Described image acquisition device, for acquiring image information to be identified；The storage dress It sets and is stored with computer program, the computer program executes side as claimed in claim when being run by the processor Method.

Fourth aspect, the embodiment of the present invention also provide a kind of meter of non-volatile program code that can be performed with processor Calculation machine readable medium, said program code make the processor execute the claims the method.

In embodiments of the present invention, image information to be identified is obtained first；Then, gesture inspection is carried out in image information It surveys and identifies, to obtain the gesture information of gesture；Next, based on the determining audio text to match with gesture of gesture information Part adds the audio file in image information, in embodiments of the present invention, by being identified to gesture, Lai Shixian sound The addition of frequency file, method implementation method provided by reality of the present invention is simple, and interactivity is stronger, and then alleviates in the prior art The technical issues of live video or short-sighted frequency can not be based on addition of the gesture identification to carry out audio special efficacy.

Other features and advantages of the present invention will illustrate in the following description, also, partly become from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention are in specification, claims And specifically noted structure is achieved and obtained in attached drawing.

To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.

Detailed description of the invention

It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.

Fig. 1 is the schematic diagram for the electronic equipment that the present invention implements embodiment；

Fig. 2 is a kind of flow chart of the adding method of audio file according to an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of training sample according to an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of the adding set of audio file according to an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with attached drawing to the present invention Technical solution be clearly and completely described, it is clear that described embodiments are some of the embodiments of the present invention, rather than Whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.

Embodiment one:

Firstly, describing the adding method of the audio file for realizing the embodiment of the present invention and showing for device referring to Fig.1 Example electronic equipment 100.

As shown in Figure 1, electronic equipment 100 include one or more processors 102, it is one or more storage device 104, defeated Enter device 106, output device 108 and image acquisition device 110, these components pass through bus system 112 and/or other forms The interconnection of bindiny mechanism's (not shown).It should be noted that the component and structure of electronic equipment 100 shown in FIG. 1 are only exemplary, and Unrestricted, as needed, the electronic equipment also can have other assemblies and structure.

The processor 102 can be central processing unit (CPU) or have data-handling capacity and/or instruction execution The processing unit of the other forms of ability, and the other components that can control in the electronic equipment 100 are desired to execute Function.

The storage device 104 may include one or more computer program products, and the computer program product can To include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.It is described easy The property lost memory for example may include random access memory (RAM) and/or cache memory (cache) etc..It is described non- Volatile memory for example may include read-only memory (ROM), hard disk, flash memory etc..In the computer readable storage medium On can store one or more computer program instructions, processor 102 can run described program instruction, to realize hereafter institute The client functionality (realized by processor) in the embodiment of the present invention stated and/or other desired functions.In the meter Can also store various application programs and various data in calculation machine readable storage medium storing program for executing, for example, the application program use and/or The various data etc. generated.

The input unit 106 can be the device that user is used to input instruction, and may include keyboard, mouse, wheat One or more of gram wind and touch screen etc..

The output device 108 can export various information (for example, image or sound) to external (for example, user), and It and may include one or more of display, loudspeaker etc..

Described image collector 110 can acquire image information to be identified, and acquired image information is stored For the use of other components in the storage device 104.

Illustratively, for realizing the adding method of audio file according to an embodiment of the present invention and the exemplary electron of device Equipment may be implemented as on the mobile terminals such as smart phone, tablet computer.

Embodiment two:

According to embodiments of the present invention, a kind of embodiment of the adding method of audio file is provided, it should be noted that In The step of process of attached drawing illustrates can execute in a computer system such as a set of computer executable instructions, also, It, in some cases, can be to be different from shown in sequence execution herein although logical order is shown in flow charts The step of out or describing.

Fig. 2 is a kind of flow chart of the adding method of audio file according to an embodiment of the present invention, as shown in Fig. 2, the party Method includes the following steps:

Step S102 obtains image information to be identified；

In embodiments of the present invention, image information to be identified can be image letter of network main broadcaster during live streaming Breath, can also be the image information in the short-sighted frequency prerecorded.

Step S104 carries out gestures detection and identification in described image information, to obtain the gesture information of gesture, In, the gesture information includes at least one of: the location information of the gesture, the type information of the gesture, the hand The real time duration of gesture；

In embodiments of the present invention, after getting image information, so that it may capture is tracked to image information, with Capture obtains the gesture information of gesture and gesture.

Wherein, location information is expressed as position of the gesture in current image information to be identified；The type information i.e. hand The type of gesture, for example, love, OK and thumbs up etc. and to belong to different types at triumph；Real time duration is expressed as appointing The continuous duration of gesture of anticipating, for example, 3 seconds continuous duration in image information of love, this 3 seconds are real When the duration, alternatively, continuous duration of the triumph in image information.

Step S106, based on the determining audio file to match with the gesture of the gesture information, in described image The audio file is added in information.

In embodiments of the present invention, pre-established an audio library, the audio library for save pre-record make it is various Audio file, for example, applause, praise sound, the various audio files such as whistle and cheer, and every kind of audio file corresponding one Kind or various gestures information.

In embodiments of the present invention, image information to be identified is obtained first；Then, gesture inspection is carried out in image information It surveys and identifies, to obtain the gesture information of gesture；Next, based on the determining audio text to match with gesture of gesture information Part, to add the audio file in image information, in embodiments of the present invention, and by being identified to gesture, Lai Shixian sound The addition of frequency file, method implementation method provided by reality of the present invention is simple, and interactivity is stronger, and then alleviates in the prior art The technical issues of live video or short-sighted frequency can not be based on addition of the gesture identification to carry out audio special efficacy.

It should be noted that in embodiments of the present invention, it is main to wrap if image information is the image information of live video Containing two lateral terminals, one is main broadcaster's lateral terminal, the other is spectators' lateral terminal.In embodiments of the present invention, main broadcaster can be passed through Any one terminal in lateral terminal or spectators' lateral terminal initiates method described in above-mentioned steps S102 to step S106.

Main broadcaster's lateral terminal:

Gesture automatic identification model can be arranged in main broadcaster's lateral terminal in main broadcaster, after the mode is set, the service of main broadcaster side Device will acquire live video, and be tracked capture to live video, to obtain the gesture information of gesture.Then, main broadcaster side takes Business device is based on the determining audio file to match with gesture of gesture information, to add the audio file in image information.

Spectators' lateral terminal:

When spectators watch main broadcaster be broadcast live content when, can spectators' lateral terminal be arranged gesture automatic identification model, After the mode is arranged, viewer side server will acquire live video, and be tracked capture to live video, in one's hands to obtain The gesture information of gesture.Then, viewer side server is based on the determining audio file to match with gesture of gesture information, and by the sound Frequency file is sent to main broadcaster side server, so that main broadcaster side server adds the audio file in image information.

In embodiments of the present invention, in order to further enhance the interest and attraction of live streaming and short-sighted frequency, can pass through The various gestures of algorithm automatic identification, and add corresponding audio.On the one hand, by means of existing image technique, gesture is a kind of Directly and it is easy the information captured；On the other hand, the interaction based on gesture is very natural, in the case where combining specific audio, It can achieve extraordinary effect.It is carried out in detail in the following, method will be provided for the embodiments of the invention in conjunction with specific embodiment It is thin to introduce.

It needs in embodiments of the present invention, constructs one or more neural network model in advance, this or more A neural network model is for detecting and identifying gesture, wherein is detected as the location information for detection gesture, is identified as being used for Identify the type information of gesture.It after constructing neural network model, needs to be trained neural network model, at this point, just It needs to construct training sample.

In embodiments of the present invention, when constructing training sample, the image comprising various gestures is acquired, (wherein, as schemed It is a kind of training sample shown in 3), and the image of every kind of gesture at different conditions is acquired, for example, different illumination, no With the image under processing mode, wherein processing mode includes: translation, scaling, the operation such as rotation.For each image, in advance First mark position and/or the classification of good hand's gesture.After the position of mark good hand's gesture and classification, so that it may after mark Picture construction training sample.In embodiments of the present invention, no less than 1000 images can be chosen and carry out composing training sample.

After building obtains training sample, so that it may be carried out by training sample to one or more neural network models Training, wherein the output of one or more neural network model are as follows: the location information of gesture and the type information of gesture.

In embodiments of the present invention, it is selectable detection and recognizer include Faster-RCNN algorithm of target detection, SSD algorithm of target detection (single shot multibox detector) and YOLO algorithm of target detection.Above-mentioned algorithm It can be completed at the same time Target detection and identification task, and support different classes of target.In embodiments of the present invention, different classes of Target be different classes of gesture.It should be noted that in embodiments of the present invention, in addition to above-mentioned several target detections are calculated Except method, this can also be not especially limited using other algorithm of target detection.

In embodiments of the present invention, after being trained to one or more neural network model, so that it may set After setting gesture automatic identification model, obtain image information to be identified, then, in image information carry out gestures detection and Identification, obtains the gesture information of gesture.

In an optional embodiment of the embodiment of the present invention, described image information progress in carry out gestures detection with And identification, included the following steps: with obtaining the gesture information of gesture

Step S1 detects and identifies the location information of the gesture and the type of the gesture in described image information Information；

Step S2 determines the real time duration of the gesture based on the location information and/or the type information.

In embodiments of the present invention, the detection gesture first in image information, and identify the gesture location information and Type information, wherein location information is the position of the gesture in the picture, and type information is meaning represented by the gesture, example Such as, triumph, applaud etc..

After the type information of the location information and gesture of determining gesture, so that it may be believed based on location information and type At least one of breath determines the real time duration of gesture.In embodiments of the present invention, real time duration is that gesture exists The continuous duration in image information.

It, can be by two ways in described image information in another optional embodiment of the embodiment of the present invention The type information of middle detection and the location information and the gesture that identify the gesture is respectively: single frames processing mode and jump Frame processing mode will specifically introduce both processing modes below.

One, single frames processing mode

It is specific to examine when the type information of location information and gesture that gesture is detected and identified using single frames processing mode It is as follows to survey process description:

Step S11 detects the gesture in each picture frame of described image information by first nerves network model, And the location information of the detection gesture；

Step S12 identifies the class of the gesture detected in each described image frame by nervus opticus network model Type information.

It should be noted that single frames processing mode is expressed as being all made of step S11 to each picture frame in image information Detection and identifying processing are carried out with mode described in step S12.

In embodiments of the present invention, it is examined in each picture frame by preparatory trained first nerves network model first The location information of gesture is surveyed, and identifies the type information of gesture by nervus opticus network model, wherein location information and type Information is the output of first nerves network model, and the location information of gesture can be indicated in the form of rectangle frame.

For example, picture frame 2 ..., picture frame n is all made of at aforesaid way for the picture frame 1 in image information Reason.Specifically, the picture frame 1 that can be will test is input in first nerves network model, so that first nerves network model The location information of detection gesture in picture frame 1, and pass through the type information of nervus opticus network model identification gesture.It can be with The picture frame 2 that will test is input in first nerves network model, so that first nerves network model detects in picture frame 2 The location information of gesture, and pass through the type information of nervus opticus network model identification gesture.And so on, for image information In other picture frames, be all made of aforesaid way and handled, details are not described herein again.

It should be noted that first nerves network model and nervus opticus network model can be identical model, may be used also To be different model, and first nerves network model and nervus opticus network model are the neural network model of foregoing description In all or part of model.

After handling to obtain the location information and type information of gesture by above-mentioned single frames processing mode, so that it may be based on Location information and/or the type information determine the real time duration of gesture.

In the case, the specific of the real time duration of gesture is determined based on location information and/or the type information Process description is as follows:

Firstly, real-time statistics include the target image frame of incidence relation in described image information, wherein the association is closed System is the relationship determining based on the location information and/or the type information；

Then, the duration of the target image frame is determined as to the real time duration of the gesture.

Wherein, the incidence relation includes:

The gesture-type of gesture is identical as the gesture-type of gesture in the second picture frame in first picture frame；And/or

In the first image frame between the hand gesture location of gesture described in the hand gesture location of gesture and the second picture frame Registration is greater than default registration；

Wherein, the first image frame and second picture frame are the figure that any two are adjacent in the target image frame As frame.

Assuming that image information first picture frame 1 into the 100th picture frame 100, include identical gesture, that is, The gesture for including is the gesture of triumph；The 101st picture frame 101 includes phase into the 200th picture frame 200 in image information Same gesture, for example, the gesture comprising applause.

In embodiments of the present invention, firstly, detecting each picture frame in image information by first nerves network model Middle detection gesture location information, and the type information by nervus opticus network model detection gesture.

For example, being detected in current image frame 75 (that is, the 75th picture frame in image information) and identifying to obtain gesture Location information 1 and type information 1；At this point it is possible to judge location information 1 and type information 1 with a upper picture frame 74 (that is, The 74th picture frame in image information) in detect and the location information 2 that identifies is associated with pass with whether type information 2 meets System.For example, judging whether the registration between location information 1 and location information 2 is greater than default registration；Alternatively, judging type Whether information 1 is identical as type information 2；Or judge whether the registration between location information 1 and location information 2 is greater than Default registration, and, judge whether type information 1 is identical as type information 2.By judging it is found that location information 1, type Information 1, location information 2 and type information 2 meet any one above-mentioned incidence relation, it is determined that go out current image frame 75 and upper one Picture frame 74 is the target image frame comprising incidence relation.At this point, by statistics it is found that determining that the 75th picture frame is target When picture frame, the play time of 75 picture frames is 3 seconds at this time, that is to say, that the real time duration of gesture is 3 seconds.

It should be noted that current image frame 75 and a upper picture frame 74 are respectively above-mentioned first picture frame and the second image When frame, location information 1 and type information 1 are the first image frame position information and type information, location information 2 and type information 2 be the second image frame position information and type information.

In embodiments of the present invention, it in the case where the real time duration is greater than preset time, looks into the database Look for the audio file to match with the gesture-type.

Specifically, it is assumed that preset time is 3 seconds, and when the real time duration of gesture is more than 3 seconds, i.e., audio is added in triggering File, at this point, the audio file to match with gesture will be searched in audio library (that is, database), and by the audio file It is added in image information.That is, above embodiment is directed to, when first nerves network model and nervus opticus network mould When type detects and recognizes the 75th picture frame, i.e., audio file is added in triggering, at this point, will search and gesture phase in audio library Matched audio file.

In embodiments of the present invention, after the addition for completing audio file, will continue to identify subsequent in image information Picture frame, to continue that the gesture information of gesture is detected and identified in image information.

Two, frame-skipping processing mode

In embodiments of the present invention, it in order to save the time, improves efficiency, frame-skipping processing mode can also be used, work as use When the location information of frame-skipping processing mode detection gesture and the type information of gesture, specific detection process is described as follows:

Step S21 detects institute every N frame in all images frame of described image information by third nerve network model State gesture, and the location information of the detection gesture；

Step S22 passes through the type information for the gesture that fourth nerve network model recognition detection arrives.

It should be noted that frame-skipping processing mode be expressed as to the parts of images frame in image information using step S21 and Mode described in step S22 come carry out detection and identifying processing.

For example, passing through the position of preparatory trained third nerve network model detection gesture in the 1st picture frame first Confidence breath, and pass through the type information of fourth nerve network model identification gesture, wherein location information and type information are respectively The output of third nerve network model and fourth nerve network model.Wherein, the location information of gesture can be using rectangle frame Form is indicated.Then, the location information and type information of detection gesture and detection gesture in the 5th picture frame, Wherein, N is 3.Under normal circumstances, the value of N is in [1,3] value interval.And so on, next is exactly to scheme at the 9th As detection gesture in frame.For other picture frames in image information, it is all made of aforesaid way and is handled, it is no longer superfluous herein It states.

It should be noted that third nerve network model and fourth nerve network model can be identical model, may be used also To be different model, and third nerve network model and fourth nerve network model are the neural network model of foregoing description In all or part of model.

Firstly, determined in described image information include incidence relation target image frame, wherein the incidence relation is The relationship determined based on the location information and/or the type information；

Wherein, the incidence relation includes:

In embodiments of the present invention, firstly, detecting all images frame in image information by third nerve network model In every N frame detection gesture location information, and by fourth nerve network model identify gesture type information.

For example, detecting and identifying in currently pending picture frame 75 (that is, the 75th picture frame in image information) Obtain the location information 1 and type information 1 of gesture；At this point it is possible to judge that location information 1 and type information 1 have been located with upper one Location information 3 and the type letter for detecting and identifying in the picture frame 71 (that is, the 71st picture frame in image information) of reason Whether breath 3 meets incidence relation.For example, judging whether the registration between location information 1 and location information 3 is greater than default be overlapped Degree；Alternatively, judging whether type information 1 is identical as type information 3；Or judge between location information 1 and location information 3 Registration whether be greater than default registration, and, judge whether type information 1 identical as type information 3.It can by judgement Know, location information 1, type information 1, location information 3 and type information 3 meet any one above-mentioned incidence relation, it is determined that go out Currently pending picture frame 75 and upper one processed picture frame 71 is the target image frame comprising incidence relation.At this point, logical Cross statistics it is found that determine the 75th picture frame be target image frame when, the play time of 75 picture frames is 3 seconds at this time, That is, the real time duration of gesture is 3 seconds.

It should be noted that currently pending picture frame 75 and a upper processed picture frame 71 are respectively above-mentioned first Picture frame and the second picture frame, at this point, location information 1 and type information 1 as the first image frame position information and type information, Location information 3 and type information 3 are the second image frame position information and type information.

In embodiments of the present invention, it in the case where the real time duration is greater than preset time, looks into the database Look for the audio file to match with the gesture-type and/or location information.

Specifically, it is assumed that preset time is 3 seconds, and when the real time duration of gesture is more than 3 seconds, i.e., audio is added in triggering File, at this point, the audio file to match with gesture will be searched in audio library (that is, database), and by the audio file It is added in image information.That is, above embodiment is directed to, when third nerve network model and fourth nerve network mould When type detects respectively and recognizes the 75th picture frame, i.e. triggering addition audio file, at this point, will be searched in audio library and hand The audio file that gesture matches.

In embodiments of the present invention, after the addition for completing audio file, will continue to identify subsequent in image information Picture frame, to continue the gesture information of the detection gesture in image information.

In another optional embodiment of the embodiment of the present invention, the audio text is added in described image information Part includes the following steps:

Step S1061 adds the audio file in the case where described image information is the image information of live video It adds in described image information；

Step S1062, in the case where described image information is the image information of non-live video, by the audio file It is embedded on the target position of described image information and plays out, wherein the target position is in the gesture trigger audio At the time of file plays.

In embodiments of the present invention, it when the real time duration of gesture is more than preset time (for example, 3 seconds), then adds Corresponding audio.For example, if " triumph " gesture clock for 3 seconds, adds cheer.For the live streaming under live scene Video only need to directly play corresponding audio-frequency information, for short when obtaining corresponding audio-frequency information according to gesture information Non- live video under video scene is then needed to be embedded into audio file on the target position of original image information and be broadcast It puts.

In embodiments of the present invention, by computer vision technique automatic identification gesture, it is not necessarily to manual intervention, is had simple Easy-to-use feature can identify the gesture in image information in real time, in turn, be triggered according to the type of gesture and duration certainly Dynamic addition audio file.By the service of the interactivity and customization, it is able to ascend the interest and attraction of video, Ke Yizeng The business of live video and short-sighted frequency is added to be worth.

Embodiment two:

The embodiment of the invention also provides a kind of adding set of audio file, the adding set of the audio file is mainly used In the adding method for executing audio file provided by above content of the embodiment of the present invention, below to provided in an embodiment of the present invention The adding set of audio file does specific introduction.

Fig. 4 is a kind of schematic diagram of the adding set of audio file according to an embodiment of the present invention, as shown in figure 4, the sound The adding set of frequency file specifically includes that acquiring unit 10, trace trap unit 20 and determination unit 30, in which:

Acquiring unit 10, for obtaining image information to be identified；

Trace trap unit 20, for carrying out gestures detection and identification in described image information, to obtain the hand The gesture information of gesture, wherein the gesture information includes at least one of: the location information of the gesture, the gesture Type information, the real time duration of the gesture；And

Determination unit 30, the audio file for being matched based on gesture information determination with the gesture, in institute It states and adds the audio file in image information.

In embodiments of the present invention, image information to be identified is obtained first；Then, gesture inspection is carried out in image information It surveys and identifies, to obtain the gesture information of gesture；Next, based on the determining audio text to match with gesture of gesture information Part adds the audio file in image information, in embodiments of the present invention, by being identified to gesture, Lai Shixian sound The addition of frequency file, method implementation method provided by the embodiment of the present invention is simple, and interactivity is stronger, and then alleviates existing skill The technical issues of live video or short-sighted frequency can not be based on addition of the gesture identification to carry out audio special efficacy in art.

Optionally, the trace trap unit includes: detection module, for detecting and identifying in described image information The type information of the location information of the gesture and the gesture；And determining module, for based on the location information and/ Or the type information determines the real time duration of the gesture.

Optionally, the detection module is used for: by first nerves network model described image information each image The gesture, and the location information of the detection gesture are detected in frame；It is each described by the identification of nervus opticus network model The type information of the gesture detected in picture frame.

Optionally, the detection module is also used to: passing through whole of the third nerve network model in described image information The location information of the gesture is detected in picture frame every N frame；The institute arrived by the fourth nerve network model recognition detection State the type information of gesture.

Optionally, described device is also used to: obtaining training sample, wherein include comprising multiple hands in the training sample The image of gesture, the image of the multiple gesture are every kind of gesture acquired image at different conditions, and pre- in each image First mark position and/or the type of good hand's gesture；The neural network model is trained based on the training sample, is instructed Neural network model after white silk.

Optionally, the determining module is used for: real-time statistics include the target figure of incidence relation in described image information As frame, wherein the incidence relation is the relationship determined based on the location information and/or the type information；By the mesh The duration of logo image frame is determined as the real time duration of the gesture.

Optionally, the incidence relation includes: the gesture-type of gesture and gesture in the second picture frame in the first picture frame Gesture-type it is identical；And/or in the first image frame gesture described in the hand gesture location of gesture and the second picture frame hand Registration between gesture position is greater than default registration；Wherein, the first image frame and second picture frame are the mesh The adjacent picture frame of any two in logo image frame.

Optionally it is determined that unit is used for: in the case where the real time duration is greater than preset time, in the database Search the audio file to match with the type information and/or location information.

Optionally it is determined that unit is also used to: in the case where described image information is the image information of live video, by institute Audio file is stated to be added in described image information；In the case where described image information is the image information of non-live video, The audio file is embedded on the target position of described image information and is played out, wherein the target position is in institute At the time of stating the broadcasting of gesture trigger audio file.

The technical effect and preceding method embodiment phase of device provided by the embodiment of the present invention, realization principle and generation Together, to briefly describe, Installation practice part does not refer to place, can refer to corresponding contents in preceding method embodiment.

In another embodiment of the present invention, a kind of tracing system of face characteristic information, the system packet are also provided It includes: image collecting device, processor and storage device；

Described image acquisition device, for acquiring image information to be identified；

Computer program is stored on the storage device, the computer program is executed when being run by the processor Such as above-mentioned method as described in the examples.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description Specific work process, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

In another embodiment of the present invention, a kind of non-volatile program code that can be performed with processor is also provided Computer-readable medium, said program code makes the processor execute the method as described in preceding method embodiment.

In addition, in the description of the embodiment of the present invention unless specifically defined or limited otherwise, term " installation ", " phase Even ", " connection " shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected；It can To be mechanical connection, it is also possible to be electrically connected；It can be directly connected, can also can be indirectly connected through an intermediary Connection inside two elements.For the ordinary skill in the art, above-mentioned term can be understood at this with concrete condition Concrete meaning in invention.

In the description of the present invention, it should be noted that term " center ", "upper", "lower", "left", "right", "vertical", The orientation or positional relationship of the instructions such as "horizontal", "inner", "outside" be based on the orientation or positional relationship shown in the drawings, merely to Convenient for description the present invention and simplify description, rather than the device or element of indication or suggestion meaning must have a particular orientation, It is constructed and operated in a specific orientation, therefore is not considered as limiting the invention.In addition, term " first ", " second ", " third " is used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, it can be with It realizes by another way.The apparatus embodiments described above are merely exemplary, for example, the division of the unit, Only a kind of logical function partition, there may be another division manner in actual implementation, in another example, multiple units or components can To combine or be desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or beg for The mutual coupling, direct-coupling or communication connection of opinion can be through some communication interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in the executable non-volatile computer-readable storage medium of a processor.Based on this understanding, of the invention Technical solution substantially the part of the part that contributes to existing technology or the technical solution can be with software in other words The form of product embodies, which is stored in a storage medium, including some instructions use so that One computer equipment (can be personal computer, server or the network equipment etc.) executes each embodiment institute of the present invention State all or part of the steps of method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read- Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can be with Store the medium of program code.

Finally, it should be noted that embodiment described above, only a specific embodiment of the invention, to illustrate the present invention Technical solution, rather than its limitations, scope of protection of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, those skilled in the art should understand that: anyone skilled in the art In the technical scope disclosed by the present invention, it can still modify to technical solution documented by previous embodiment or can be light It is readily conceivable that variation or equivalent replacement of some of the technical features；And these modifications, variation or replacement, do not make The essence of corresponding technical solution is detached from the spirit and scope of technical solution of the embodiment of the present invention, should all cover in protection of the invention Within the scope of.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a kind of adding method of audio file characterized by comprising

Obtain image information to be identified；

Gestures detection and identification are carried out in described image information, to obtain the gesture information of the gesture, wherein the hand Gesture information includes: the location information of the gesture, the type information of the gesture, the real time duration of the gesture；

Based on the determining audio file to match with the gesture of the gesture information, described in being added in described image information Audio file；

Wherein, gestures detection and identification are carried out in described image information, include: to obtain the gesture information of the gesture

The location information of the gesture and the type information of the gesture are detected and identified in described image information；

The real time duration of the gesture is determined based on the location information and the type information；

Determine that the audio file to match with the gesture includes: based on the gesture information

In the case where the real time duration is greater than preset time, search and the type information and described in the database The audio file that location information matches.

2. the method according to claim 1, wherein the gesture is detected and identified in described image information Location information and the type information of the gesture include:

The location information of the gesture is detected in each picture frame of described image information by first nerves network model；

The type information of the gesture detected in each described image frame is identified by nervus opticus network model.

3. the method according to claim 1, wherein the gesture is detected and identified in described image information Location information and the type information of the gesture include:

By third nerve network model in all images frame in described image information the gesture described in the N frame image detection Location information；

Pass through the type information for the gesture that fourth nerve network model recognition detection arrives.

4. according to the method in claim 2 or 3, which is characterized in that the method also includes:

Obtain training sample, wherein include the image comprising multiple gestures, the image of the multiple gesture in the training sample For every kind of gesture acquired image at different conditions, and position and the type of good hand's gesture are marked in each image in advance；

Neural network model is trained based on the training sample, obtains first nerves network model after training, Two neural network models, third nerve network model and fourth nerve network model.

5. method according to any one of claim 1 to 2, which is characterized in that be based on the location information and the class Type information determines that the real time duration of the gesture includes:

In described image information real-time statistics include incidence relation target image frame, wherein the incidence relation be based on The relationship that the location information and the type information determine；

The duration of the target image frame is determined as to the real time duration of the gesture.

6. according to the method described in claim 5, it is characterized in that, the incidence relation includes:

The gesture-type of gesture is identical as the gesture-type of gesture in the second picture frame in first picture frame；

Being overlapped between the hand gesture location of gesture and the hand gesture location of gesture described in the second picture frame in the first image frame Degree is greater than default registration；

Wherein, the first image frame and second picture frame are the image that any two are adjacent in the target image frame Frame.

7. the method according to claim 1, wherein adding the audio file packet in described image information It includes:

In the case where described image information is the image information of live video, the audio file is added to described image letter In breath；

In the case where described image information is the image information of non-live video, the audio file is embedded into described image It is played out on the target position of information, wherein the target position is at the time of the gesture trigger audio file plays.

8. a kind of adding set of audio file characterized by comprising

Acquiring unit, for obtaining image information to be identified；

Trace trap unit, for carrying out gestures detection and identification in described image information, to obtain the hand of the gesture Gesture information, wherein the gesture information includes: the location information of the gesture, the type information of the gesture, the gesture Real time duration；

Determination unit, the audio file for being matched based on gesture information determination with the gesture, in described image The audio file is added in information；

Wherein, the trace trap unit is used for: detecting and identify the location information of the gesture in described image information With the type information of the gesture；

The determination unit is used for: the real time duration be greater than preset time in the case where, in the database search with The audio file that the type information and location information match.

9. a kind of add-on system of audio file, which is characterized in that the system comprises: it image collecting device, processor and deposits Storage device；

Computer program is stored on the storage device, the computer program is executed when being run by the processor as weighed Benefit require any one of 1 to 7 described in method.

10. a kind of computer-readable medium of the executable non-volatile program code of storage processor, which is characterized in that described Program code makes the processor execute any the method in the claims 1-7.