CN110276349A

CN110276349A - Method for processing video frequency, device, electronic equipment and storage medium

Info

Publication number: CN110276349A
Application number: CN201910551225.9A
Authority: CN
Inventors: 刘宁; 王一棋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-09-24
Anticipated expiration: 2039-06-24
Also published as: CN110276349B

Abstract

This application involves technical field of data processing, a kind of method for processing video frequency, device, electronic equipment and storage medium are disclosed, which comprises obtain the current video frame in the video to be translated acquired in real time；Determine the target area for the text to be translated for including in the current video frame；Determine the target area corresponding translation region in the translation of the text to be translated；The current video frame will be shown behind target area described in the translation region overlay.Method for processing video frequency provided by the embodiments of the present application can in real time translate the text in video to be translated, and entire translation process does not need user and implements operation, to improve the experience of user.

Description

Method for processing video frequency, device, electronic equipment and storage medium

Technical field

This application involves technical field of data processing more particularly to a kind of method for processing video frequency, device, electronic equipment and deposit Storage media.

Background technique

In traditional picture interpretative system, user is taken pictures using the text that mobile terminal translates needs, then right Text in photo is identified and is translated.However when using aforesaid way translation, needs user first to click and take pictures, then click and turn over Translate, the translation of the text in photo could be obtained, when being switched to new photographed scene, it is seen that be still before shoot photograph Piece needs user to click again and takes pictures and translate, and could carry out identification translation to the text in new photographed scene, therefore, existing Some picture interpretative systems can not carry out real time translation in conjunction with the variation of photographed scene, and operation is relatively complicated, and user experience is not high.

Summary of the invention

The embodiment of the present application provides a kind of method for processing video frequency, device, electronic equipment and storage medium, can be in conjunction with shooting The variation of scene carries out real time translation to text information in the video of acquisition.

In a first aspect, one embodiment of the application provides a kind of method for processing video frequency, comprising:

Obtain the current video frame in the video to be translated acquired in real time；

Determine the target area for the text to be translated for including in the current video frame；

Determine the target area corresponding translation region in the translation of the text to be translated；

The current video frame will be shown behind target area described in the translation region overlay.

Optionally, further includes:

Extract characteristic point from the REF video frame for determining the text to be translated, the REF video frame be it is described to Translate any video frame before current video frame described in video；

Determine the text to be translated in the REF video frame corresponding region and the characteristic point in the benchmark The mapping relations between location information in video frame；

The target area for the text to be translated for including in the determination current video frame, specifically includes:

Determine location information of the characteristic point in the current video frame；

According to the position of the corresponding mapping relations of text to be translated and the characteristic point in the current video frame Confidence breath, determines target area of the text to be translated in the current video frame.

Optionally, described by target area described in the translation region overlay, it specifically includes:

According to the target area, the size and shape in the translation region is adjusted；

By target area described in translation region overlay adjusted.

Optionally, further includes:

Extract the background texture in the text to be translated corresponding region in the REF video frame；

The background texture of extraction is determined as the corresponding background texture of the translation.

Optionally, further includes:

In response to the specified operation of user's input, or detect that the mobile terminal for acquiring the video to be translated is in specified Motion state, or detect that picture variation meets the first preset condition in the video to be translated, redefines text to be translated.

Optionally, described to redefine text to be translated, it specifically includes:

REF video frame is reacquired from the video to be translated, carries the REF video frame to server transmission Translation request, obtain the text to be translated determined from the REF video frame that the server returns；

Alternatively, from the video to be translated reacquire REF video frame, determined from the REF video frame to Cypher text.

Optionally, text to be translated is determined from the REF video frame by the following method:

If multiple text fragments are identified from the REF video frame, according to preset strategy to the multiple text fragment Processing is merged, the text fragment after merging is determined as text to be translated.

Optionally, described that processing is merged to the multiple text fragment according to preset strategy, by the text after merging Paragraph is determined as text to be translated, specifically includes:

If meeting the second default item in the REF video frame between the adjacent corresponding text parameter of multiple text fragments Adjacent multiple text fragments are then merged into a text fragment, the text fragment after merging are determined as wait turn over by part Translation sheet；

Wherein, the text parameter includes at least one of the following: position of the text fragment in the REF video frame Confidence breath, the corresponding languages of the text fragment, the corresponding font of the text fragment, the corresponding text of the text fragment are big The corresponding text color of small and described text fragment.

At least one text fragment is identified from the REF video frame；

Determine the corresponding languages of the text fragment；

Based on the text fragment for not being target language, determine that at least one text to be translated, the target language are institute State the corresponding languages of translation.

Optionally, the corresponding languages of the determination text fragment, specifically include:

N-gram feature extraction processing is carried out to the text fragment, obtains several text fragments；

By corresponding several text fragments input of text fragment languages identification model trained in advance, obtain described The corresponding languages of text fragment.

Second aspect, one embodiment of the application provide a kind of video process apparatus, comprising:

Module is obtained, for obtaining the current video frame in the video to be translated acquired in real time；

Target area determining module, for determining the target area for the text to be translated for including in the current video frame；

Translation area determination module is translated for determining that the target area is corresponding in the translation of the text to be translated Literary region；

Fusion Module, for will show the current video frame behind target area described in the translation region overlay.

Optionally, further include mapping block, be used for:

The target area determining module is specifically used for:

Optionally, the Fusion Module is specifically used for:

By target area described in translation region overlay adjusted.

Optionally, further include texture processing module, be used for:

Optionally, further include text identification module, be used for:

Optionally, the text identification module is specifically used for:

At least one text fragment is identified from the REF video frame；

Determine the corresponding languages of the text fragment；

Optionally, further include languages identification module, be used for:

The third aspect, one embodiment of the application provide a kind of electronic equipment, including memory, processor and are stored in On reservoir and the computer program that can run on a processor, wherein processor is realized any of the above-described when executing computer program The step of kind method.

Fourth aspect, one embodiment of the application provide a kind of computer readable storage medium, are stored thereon with computer The step of program instruction, which realizes any of the above-described kind of method when being executed by processor.

5th aspect, one embodiment of the application provide a kind of computer program product, the computer program product packet The computer program being stored on computer readable storage medium is included, the computer program includes program instruction, described program The step of instruction realizes any of the above-described kind of method when being executed by processor.

Technical solution provided by the embodiments of the present application, be mobile terminal shoot video during carry out, and During being somebody's turn to do, user only needs for text to be translated to be presented in the pickup area of the video capture device of the mobile terminal, just The corresponding translation of the text to be translated can be seen in video, and entire translation process does not need user and implements operation, to mention The experience of high user.After acquisition for mobile terminal to text to be translated and corresponding translation, mobile terminal passes through to be translated Text is kept track, that is, can determine text to be translated corresponding target area in subsequent each video frame, then with corresponding to Translation region overlay video frame in target area, without repeat identify and translate, improve translation efficiency, together When, even if the mobile mobile terminal in translation process, translation can also follow the position of text to be translated in video to be translated to become Change real-time display in corresponding position, guarantees that translation can be good at merging with reality scene, so that display effect is more life-like.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application will make below to required in the embodiment of the present application Attached drawing is briefly described, it should be apparent that, attached drawing described below is only some embodiments of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the application scenarios schematic diagram of method for processing video frequency provided by the embodiments of the present application；

Fig. 2 is the flow diagram for the method for processing video frequency that one embodiment of the application provides；

Fig. 3 is the flow diagram for the method for processing video frequency that one embodiment of the application provides；

Fig. 4 is the process of target area of the determination text to be translated that provides of one embodiment of the application in current video frame Schematic diagram；

Fig. 5 A is the position view of the characteristic point and text to be translated in benchmark video frame；

Fig. 5 B is the position view of the characteristic point and text to be translated in a certain video frame；

Fig. 5 C is the position view of the characteristic point and text to be translated in a certain video frame；

Fig. 6 is the schematic diagram for determining target area corresponding translation region in the translation of text to be translated；

Fig. 7 is that the schematic diagram of current video frame will be shown behind translation region overlay target area；

Fig. 8 A is the schematic diagram that multiple text fragments are determined from current video frame；

Fig. 8 B is the schematic diagram for the video frame that fusion has translation；

Fig. 9 is the structural schematic diagram for the video process apparatus that one embodiment of the application provides；

Figure 10 is the structural schematic diagram for the electronic equipment that one embodiment of the application provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described.

In order to facilitate understanding, noun involved in the embodiment of the present application is explained below:

Augmented reality (Augmented Reality, abbreviation AR) is a kind of position for calculating camera image in real time Set and angle and plus respective image, video, 3D model technology, the target of this technology is on the screen virtual world set In real world and interacted.

Virtual reality technology (Virtual Reality, abbreviation VR), specific intension are comprehensive utilization computer graphical systems System and the interface equipments such as various reality and control, provide in three-dimensional environment generating on computers, can interacting and immerse feeling Technology.

OCR (Optical Character Recognition, optical character identification), referring to will with character identifying method Text conversion in image at computword process.

Oriented FAST and Rotated BRIEF, abbreviation ORB, this feature detective operators are in famous FAST Put forward on the basis of feature detection and BRIEF Feature Descriptor, runing time is far superior to SIFT and SURF, can apply It is detected in real-time feature.The detection of ORB feature has scale and rotational invariance, also has not for noise and its perspective transform Denaturation, good performance are very extensive using application scenarios of the ORB when carrying out feature and describing.The detection of ORB feature is main It is divided into following two step: the detection of direction FAST characteristic point and the description of BRIEF feature.

N-gram is a kind of algorithm based on statistical language model, and be otherwise known as single order Markov Chain.Its basic think of Think it is that the content inside text is carried out the sliding window that size is N according to byte to operate, forms the byte segment that length is N Sequence.Each byte segment is known as gram, counts to the occurrence frequency of all gram, and according to being previously set Threshold value be filtered, form key gram list, that is, the vector characteristics space of this text.Each in list Gram is exactly a feature vector dimension.

Perspective transform (Perspective Transformation) is by picture projection to a new view plane (Viewing Plane), also referred to as projection mapping (Projective Mapping).

Any number of elements in attached drawing is used to example rather than limitation and any name are only used for distinguishing, without With any restrictions meaning.

During concrete practice, user needs the text translated using the camera of mobile terminal to needs to carry out click bat According to then being identified to the text in photo, and click translation, in this way, leading to the process translated using mobile terminal In, it needs user to implement corresponding operation in different phase, for example clicks corresponding key, when being switched to new photographed scene When, it needs user to implement an aforesaid operations again, identification translation could be carried out to the text in new photographed scene.Therefore, Existing picture interpretative system can not carry out real time translation in conjunction with the variation of photographed scene, so that translating operation process is more numerous Trivial, user experience is not high.

For this purpose, present inventor from the video to be translated that mobile terminal acquires it is considered that first determine to be translated Text, and determine the corresponding translation of text to be translated, then the text to be translated in video to be translated is tracked, is determined Text to be translated region locating in current video frame is as target area, then determines the target area in the text to be translated Translation in corresponding translation region, that is, the translation region for needing to show in current video frame is determined, finally by translation area Current video frame is shown behind target area in domain covering current video frame.Method for processing video frequency in the embodiment of the present application be It is carried out during mobile terminal shooting video, and in this process, user only needs text to be translated being presented in this In the pickup area of the video capture device of mobile terminal, so that it may the corresponding translation of the text to be translated is seen in video, Entire translation process does not need user and implements operation, to improve the experience of user.In addition, using above-mentioned interpretative system, it is only necessary to Text to be translated in video to be translated is once identified and translated, is carried out subsequently through to the text to be translated identified Tracking, with target area of the determination text to be translated in current video frame, then with corresponding translation region overlay target area , without repeating to identify and translate, translation efficiency is improved, meanwhile, even if the mobile mobile terminal in translation process, Translation can also follow the change in location real-time display of text to be translated in video to be translated in corresponding position, guarantee translation energy It is enough to be merged well with reality scene, so that display effect is more life-like.

In addition, present inventor also found, the text to be translated in existing picture interpretative system, in identification picture When, each style of writing in picture is originally identified as a text to be translated, then turns over respectively to each text to be translated It translates to obtain its corresponding translation.And in practical application, because of same a word or the same word the reason of text composition, in picture May not it can be identified as different texts to be translated with a word or the same word, caused in the same row, in this case The deviation of semantic understanding is generated due to can not be in conjunction with context when translation, to reduce the accuracy of translation, even results in and turns over It translates unsuccessfully.

For this purpose, present inventor is it is considered that before translating, to identifying multiple texts from video to be translated Paragraph merges processing, the text fragment for meeting preset strategy is merged into a text fragment, as a text to be translated This, then translates the text to be translated determined, can sufficiently be translated in this way in conjunction with context, improve translation Accuracy.

After having introduced the design philosophy of the embodiment of the present application, the technical solution of the embodiment of the present application can be fitted below Application scenarios do some simple introductions, it should be noted that application scenarios introduced below are merely to illustrate the application reality Apply example and non-limiting.In the specific implementation, technical side provided by the embodiments of the present application can be neatly applied according to actual needs Case.

It is the application scenarios schematic diagram of method for processing video frequency provided by the embodiments of the present application with reference to Fig. 1.As shown in Figure 1, It may include mobile terminal 101 and server 102 in the application scenarios, wherein server 102 turns over it is considered that being to provide picture Translate the background server of service, the server cluster or cloud that server 102 can be a server, several servers form Calculating center is equipped with the application program for supporting any method for processing video frequency in the embodiment of the present application, clothes in mobile terminal 101 Business device 102 can be communicatively coupled by internet and mobile terminal, application program can by internet and server 102 into Row interaction, to obtain the picture translation service of server 102.User starts the application program in mobile terminal 101, into translation Function interface determines target language, and by calling the video capture device of mobile terminal 101 to the desired translation of user wait turn over Translation originally carries out shooting and obtains video to be translated, and application program intercepts a frame image as REF video from video to be translated Frame sends the translation request for carrying the REF video frame to server 102；Server 102 is determined from the REF video frame Text to be translated out, and treat cypher text and translated to obtain corresponding translation, text to be translated and translation are returned into shifting Dynamic terminal 101；Mobile terminal 101 carries out the text to be translated in video to be translated after receiving text and translation to be translated Tracking determines text to be translated target area locating in current video frame, then determines needs according to target area The translation region shown in current video frame will finally show behind the target area in translation region overlay current video frame and work as Preceding video frame, in this way, user can be seen on the display screen of mobile terminal comprising the corresponding translation of the text to be translated Video.

In addition, mobile terminal 101 can also realize offline translation.User starts the application program in mobile terminal 101, into Enter interpretative function interface, determines target language, and by calling the video capture device of mobile terminal 101 to want translation to user Text to be translated carry out shooting and obtain video to be translated, application program intercepts a frame image as benchmark from video to be translated Video frame determines text to be translated from the REF video frame, and treats cypher text and translated to obtain corresponding translation, Then, the text to be translated in video to be translated is tracked, determines that text to be translated is locating in current video frame Target area determines the translation region for needing to show in current video frame according to target area, finally covers translation region Behind target area in lid current video frame, current video frame is shown by the display screen of mobile terminal, in this manner it is possible to not having In the case where having network, offline translation is realized.

It should be noted that the video to be translated being previously mentioned in the embodiment of the present application can be and be regarded using mobile terminal Accessed video in frequency shooting process, can also be the video for calling mobile terminal to complete shooting.It specifically, can be with It is the equipment called in mobile terminal with video acquisition function, for example camera or other image sensing apparatus obtain video.

In application scenarios shown in Fig. 1, mobile terminal can be the equipment for being provided with video acquisition function, such as be arranged There are the smart phone, tablet computer, E-book reader, dynamic image expert's compression standard audio of the video acquisitions function such as camera Level 4 (MovingPicture Experts Group Audio Layer IV, MP4) player, smart television, intelligent hand The electronic equipments such as table, intelligent glasses, Intelligent bracelet and pocket computer on knee, the camera can be understood as with video acquisition The equipment of function, such as the photographing module with camera lens.

Method provided by the embodiments of the present application can also be applied to AR equipment, such as AR glasses.The mountable the application of AR equipment The application program of any method for processing video frequency in embodiment passes through the video acquisition of AR equipment when user uses AR equipment Device acquires the corresponding video of reality scene around user in real time, a frame image is intercepted from video as REF video frame, The text to be translated for including in reality scene is determined from the REF video frame, and is treated cypher text and translated to obtain pair Then the translation answered is tracked the text to be translated in video, determine that text to be translated is locating in current video frame Target area, the translation region for needing to show in current video frame is determined according to target area, by translation region overlay Behind target area in current video frame, work as forward sight after showing fusion translation to user finally by the display device of AR equipment Frequency frame, in this manner it is possible in user using in AR device procedures, for user provide it is real-time, closely merged with reality scene Translation experience.

Method provided by the embodiments of the present application can also be applied to VR equipment, such as VR glasses, the VR helmet.When user uses When VR equipment, before by VR picture exhibition to user, first with any method for processing video frequency in the embodiment of the present application to working as Preceding VR picture to be shown is handled, and concrete processing procedure includes: to the text to be translated in current VR picture to be shown It is tracked, determines text to be translated target area locating in current VR picture to be shown, it is true according to target area The translation region for needing to show in current VR picture to be shown is made, by the current VR picture to be shown of translation region overlay In target area after, finally by VR equipment display device to user show fusion translation after VR picture, in this way, can To provide translation experience that is real-time, closely merging with reality scene for user using in VR device procedures.For example, some VR video or VR game only have English edition, cause the user for being ignorant of English that can not understand the English in video or game, user The experience of VR equipment bring immersion can not be enjoyed well, it, can be to each by the method for processing video frequency of the embodiment of the present application The VR video of kind language or VR game carry out real time translation, and the user of country variant is facilitated to use.

Certainly, it method provided by the embodiments of the present application and is not exclusively in the above-mentioned application scenarios enumerated, can be also used for Other possible application scenarios, the embodiment of the present application are simultaneously not limited.For each equipment institute of application scenarios shown in FIG. 1 The function being able to achieve will be described together in subsequent embodiment of the method, not repeat excessively first herein.

To further illustrate technical solution provided by the embodiments of the present application, with reference to the accompanying drawing and specific embodiment pair This is described in detail.Although the embodiment of the present application provides as the following examples or method operating procedure shown in the drawings, It but based on routine or in the method may include more or less operating procedure without creative labor.It is patrolling It collected in upper the step of there is no necessary causalities, the execution sequence of these steps is not limited to execution provided by the embodiments of the present application Sequentially.

Below with reference to application scenarios shown in FIG. 1, technical solution provided by the embodiments of the present application is illustrated.

With reference to Fig. 2, the embodiment of the present application provides a kind of method for processing video frequency, comprising the following steps:

The specified operation that S201, mobile terminal are inputted in response to user obtains the video to be translated acquired in real time.

In the embodiment of the present application, specified operation starts the operation of translation for characterization, for example, clicking on Application Program Interface Start to translate button.After the specified operation of user's input, mobile terminal calls video capture device immediately, acquires video to be translated. User can also first pass through Application Program Interface setting target language in advance, i.e. the corresponding languages of translation, application program can store use The target language of family setting.

In the embodiment of the present application, the collected whole region of video capture device institute energy of entire mobile terminal can be made For viewfinder area, that is, the picture of the video to be translated got can be the pickup area of entire video capture device, be equivalent to Using the entire display screen of mobile terminal as view-finder, in this way, can be to the whole pictures shown on mobile terminal display screen The text of middle appearance is translated, it can is disposably all translated these texts to be translated, is respectively obtained corresponding Translation.

S202, mobile terminal obtain REF video frame from video to be translated, carry REF video to server transmission The translation request of frame.

When it is implemented, the video capture device institute of mobile terminal is collected to be translated if mobile terminal is mobile very fast Video is relatively fuzzyyer, causes the difficulty for identifying the text to be translated in the video to be translated to increase, to influence to identify to be translated The speed and accuracy of text.For this purpose, the motion state of mobile terminal can first be obtained, such as can be by built in mobile terminal Sensor get the exercise data of the mobile terminal then the exercise data got analyzed and handled, thus The motion state of the mobile terminal is determined, for example, (i.e. mobile terminal is mobile when the exercise data amplitude of variation got is smaller It is relatively slow) when, it is believed that the mobile terminal is in stable motion state, conversely, when the exercise data amplitude of variation got When larger (i.e. mobile terminal is mobile very fast), it is believed that the mobile terminal is in unstable motion state.Determining the shifting The motion state of dynamic terminal is in stable motion state and then obtains a frame video frame from video to be translated as benchmark Video frame.

When it is implemented, the clarity of the video frame in video to be translated can also be determined first, select clarity higher One video frame is as REF video frame.View to be translated can also be determined by the diversity factor between multiple continuous video frames The frame stabilization degree of frequency, when the diversity factor between multiple continuous video frames is smaller, it is believed that the picture of video to be translated Stability is higher (i.e. mobile terminal movement slower), can be with conversely, when the diversity factor between multiple continuous video frames is larger Think that the frame stabilization degree of video to be translated is lower (i.e. mobile terminal is mobile very fast), from the lesser multiple continuous views of diversity factor Select a frame as REF video frame in frequency frame.

In order to guarantee the speed of picture transfer and the quality of translation, according to OCR engine to identification picture limitation, to movement The picture that terminal is sent is sampled and is compressed, so that the data of network transmission are few as far as possible, while guaranteeing picture translation again Quality.

S203, server determine text to be translated and to be translated after receiving translation request from REF video frame Text to be translated and corresponding translation are sent to mobile terminal by the corresponding translation of text.

Object language in the embodiment of the present application can be the corresponding text of any one languages, for example, Chinese, English, Japanese etc..It is corresponding that the text to be translated being previously mentioned in the embodiment of the present application can be any one languages in addition to target language Text.

When it is implemented, server can identify the text for including in REF video frame as to be translated by OCR engine Then text to be translated is translated into the corresponding text of target language using translation engine, obtains corresponding translation by text.Its In, OCR engine and translation engine can be achieved by the prior art, and detailed process repeats no more.It can wrap in REF video frame Including one or more texts to be translated can be disposably complete by these texts to be translated when identifying multiple texts to be translated Portion is translated, and corresponding translation is respectively obtained.

When it is implemented, step S203 can be implanted into OCR engine and translation by mobile terminal execution in the terminal Engine calls directly local OCR engine from REF video in mobile terminal after obtaining REF video frame in video to be translated Text to be translated is identified in frame, local translation engine is then called to obtain the corresponding translation of text to be translated, in this way, can be with It realizes offline translation, when network between the mobile terminal and the server is abnormal, ensures that user can by offline translation Translation service is got in time.In the normal situation of network, user also can choose offline translation mode, to be quickly obtained Translation result, and saving network flow.Since the operational capability of mobile terminal is lower compared with server, offline translation The accuracy of the translation obtained under mode is lower, in order to guarantee translation accuracy, under offline translation mode, can timing to Server sends translation request, updates the translation knot obtained under offline translation mode after obtaining the translation result that server returns Fruit.

After receiving text to be translated and corresponding translation, acquisition acquires to be translated in real time for S204, mobile terminal Current video frame in video.

In the embodiment of the present application, current video frame is video capture device currently collected video frame, according to step The method of S205- step S207 treated current video frame, will be by the display screen real-time exhibition of mobile terminal to user.

S205, mobile terminal determine the target area for the text to be translated for including in current video frame.

S206, mobile terminal determine target area corresponding translation region in the translation of text to be translated.

S207, mobile terminal will show current video frame behind translation region overlay target area.

Mobile terminal is acquired after text to be translated to receiving according to the corresponding method of step S204- step S207 Video frame is handled frame by frame, and the corresponding translation of text to be translated is fused to each video frame comprising text to be translated In, and show fused video frame.

Method for processing video frequency in the embodiment of the present application be mobile terminal shoot video during carry out, and During being somebody's turn to do, user only needs for text to be translated to be presented in the pickup area of the video capture device of the mobile terminal, just The corresponding translation of the text to be translated can be seen in video, and entire translation process does not need user and implements operation, to mention The experience of high user.After acquisition for mobile terminal to text to be translated and corresponding translation, mobile terminal passes through to be translated Text is kept track, that is, can determine text to be translated corresponding target area in subsequent each video frame, then with corresponding to Translation region overlay video frame in target area, without repeat identify and translate, improve translation efficiency, together When, even if the mobile mobile terminal in translation process, translation can also follow the position of text to be translated in video to be translated to become Change real-time display in corresponding position, guarantees that translation can be good at merging with reality scene, so that display effect is more life-like.

On the basis of method shown in Fig. 2, lower mask body introduces the method that mobile terminal side executes.

With reference to Fig. 3, the embodiment of the present application provides a kind of method for processing video frequency, is applied to mobile terminal shown in FIG. 1, including Following steps:

The current video frame in video to be translated that S301, acquisition acquire in real time.

S302, the target area for determining the text to be translated for including in current video frame.

Text to be translated in this step can be to be identified to obtain before step S301 by OCR engine, specific method The specific embodiment that can refer to step S203, repeats no more.In addition, the text to be translated can also be obtained using OCR engine The location of in REF video frame, such as coordinate value of the text to be translated in REF video frame, it is obtained based on OCR engine Coordinate value determine text to be translated corresponding region in REF video frame, it is general using OCR engine identify wait turn over The corresponding region of translation sheet is a rectangle, therefore the embodiment of the present application is mainly illustrated by taking rectangle as an example, but is not excluded for knowing Not Chu region be other shapes, such as trapezoidal, for other shapes of processing method type, the embodiment of the present application is no longer one by one It repeats.Wherein, REF video frame is the frame video frame chosen from the video frame before current video frame in video to be translated, Specifically refer to the specific embodiment of step S202.

When it is implemented, characteristic point can be extracted from the REF video frame for determining text to be translated in advance, and determine to Cypher text is closed in the mapping of region corresponding in REF video frame and characteristic point between the location information in REF video frame System.For convenience of description, it is subsequent by text to be translated in REF video frame corresponding region, referred to as first area.Wherein, it maps Relationship can be the coordinate value of characteristic feature point and determine first area coordinate value between relative positional relationship, if clap The object taken the photograph in scene does not move, and relative positional relationship would not change.When the coordinate value of characteristic point in the video frame occurs When variation, first area corresponding coordinate value in the video frame can be uniquely determined out based on the mapping relations, so that it is determined that First area corresponding region in the video frame out.Each of determine that text to be translated is unique right from REF video frame Answer a mapping relations.

In actual application, spy can be extracted from REF video frame by any one existing feature point extraction algorithm Point is levied, in this regard, the embodiment of the present application is not construed as limiting, for example, SIFT (Scale-invariant feature transform, ruler Spend invariant features transformation) algorithm, Oriented FAST and Rotated BRIEF algorithm etc..

As shown in figure 4, step S302 specifically includes as follows based on the mapping relations of the characteristic point and determination extracted in advance Step:

S3021, location information of the characteristic point in current video frame is determined.

In actual application, any one existing track algorithm can be used, characteristic point is tracked, for example, light stream Track algorithm or template matching algorithm, to determine position of each characteristic point in any video frame in REF video frame Information, i.e. coordinate value.When it is implemented, being not limited to the above-mentioned track algorithm enumerated.

S3022, the location information according to the corresponding mapping relations of text to be translated and characteristic point in current video frame, Determine target area of the text to be translated in current video frame.

In this step, closed based on location information of the characteristic point in current video frame and the corresponding mapping of text to be translated System, can uniquely determine out the coordinate value of the text to be translated in current video frame, can determine based on determining coordinate value The text to be translated region locating in current video frame, i.e. target area out.When there are multiple texts to be translated, it is based on The corresponding mapping relations of each text to be translated, determine the corresponding target area of each text to be translated.

As shown in Figure 5A, show REF video frame P₀In characteristic point X₁、X₂……X₆And text pair to be translated The first area 501 answered, the general first area obtained using OCR engine is a rectangle, therefore can pass through first area 501 Four vertex Q₁、Q₂、Q₃And Q₄Coordinate value uniquely determine first area in REF video frame P₀In position, text to be translated Mapping relations between corresponding mapping relations, that is, first area and the coordinate value of characteristic point, can specifically include: Q₁With X₁、 X₂……X₆Between coordinate relationship, Q₂With X₁、X₂……X₆Between coordinate relationship, Q₃With X₁、X₂……X₆Between coordinate close System and Q₄With X₁、X₂……X₆Between coordinate relationship.

As shown in Figure 5 B, show REF video frame P₀A certain video frame P later₁, at this point, mobile terminal has occurred It is mobile, cause text to be translated in video frame P₁In position relative datum video frame P₀Movement has occurred, it is true by track algorithm Make X₁、X₂……X₆In video frame P₁In coordinate value, then, according to the corresponding mapping relations of text to be translated and X₁、 X₂……X₆In video frame P₁In coordinate value, determine Q₁、Q₂、Q₃And Q₄In video frame P₁In coordinate value, Q₁、Q₂、Q₃And Q₄ In video frame P₁In the region i.e. target area that surrounds.Because the shooting visual angle of the video capture device of mobile terminal transmits change Change, cause text to be translated that the movement, rotation and deformation of position have occurred in the video frame, is equally determined by track algorithm X out₁、X₂……X₆Coordinate value in the video frame, then, according to the corresponding mapping relations of text to be translated and X₁、X₂…… X₆Coordinate value in the video frame, determines Q₁、Q₂、Q₃And Q₄Coordinate value in the video frame, Q₁、Q₂、Q₃And Q₄In video frame The region surrounded in coordinate value i.e. target area.

When the mobile range of mobile terminal is larger, the partial content in text to be translated may cause not in video acquisition In the pickup area of equipment, Fig. 5 C shows REF video frame P₀A certain video frame P later₂, video frame P₂In only include to Partial content and Partial Feature point (including X in cypher text₁、X₂And X₆), at this point, can still be determined by track algorithm X₁、X₂And X₆In video frame P₂In coordinate value, then, according to the corresponding mapping relations of text to be translated and X₁、X₂And X₆? Video frame P₂In coordinate value, determine Q₁、Q₂、Q₃And Q₄In video frame P₂In coordinate value, it should be noted that determine at this time Q out₂And Q₄Coordinate value exceeded video frame P₂Indication range, then take based on Q₁、Q₂、Q₃And Q₄In video frame P₂In Coordinate value area defined and video frame P₂The region of intersection, as target area.

In practical application, either move, rotate or inclination movement terminal, can determine through the above way place to Target area of the cypher text in current video frame.

It above are only exemplary illustration, in practical application, in order to improve the accuracy of tracking, extracted from REF video frame Characteristic point quantity far more than the quantity in above-mentioned example.

S303, target area corresponding translation region in the translation of text to be translated is determined.

When it is implemented, subsequent processing can generate after the translation for getting text to be translated according to translation for convenience Then corresponding textures utilize interception in this way, corresponding translation region can directly be intercepted from textures according to target area Target area in translation region overlay current video frame.It specifically, can be according to text to be translated in REF video frame One region determines the size and shape of the corresponding textures of the translation of the text to be translated, i.e. size, the shape of the corresponding textures of translation Shape is consistent with the size of first area, shape, and translation is added in textures.

When it is implemented, the situation with reference to shown in Fig. 5 B, if according to the corresponding mapping relations of text to be translated and feature Location information of the point in current video frame, determines text to be translated corresponding coordinate value in current video frame, is based on this The region that a little coordinate values surround is in the indication range of current video frame, then translation region is the corresponding entire textures of translation. If having partial region in current video frame based in the text to be translated region that corresponding coordinate value surrounds in current video frame Indication range outside, then the part according to text to be translated in current video frame indication range, intercepts corresponding from textures Translation region, with reference to situation shown in fig. 6, the vertex Q of the textures 601 of translation₁、Q₂、Q₃、Q₄It is corresponding in current video frame 602 Point be respectively Q₁’、Q₂’、Q₃’、Q₄', wherein Q₃' and Q₄' indication range that has exceeded current video frame 602, so target Region is Q₁’、Q₂’、Q₃’、Q₄' region intersected with current video frame 602 of the region that surrounds, i.e. Q₁’、Q₅’、Q₆’、Q₄' surround Region, then, the geometrical relationship based on textures 601 and target area determines Q₅' and Q₆' the corresponding position in textures 601 Point Q₅And Q₆, Q in textures 601₁、Q₅、Q₆、Q₄Corresponding region is translation region.In Fig. 6, in order to clearly show textures 601 The translation in textures 601 and the text to be translated in target area is omitted in relationship between target area.

S304, current video frame will be shown behind translation region overlay target area.

When it is implemented, in the following manner by translation region overlay target area: according to target area, adjusting translation area The size and shape in domain, by translation region overlay adjusted target area.

Specifically, perspective transformation matrix can be determined according to the coordinate value of target area and the coordinate value in translation region, according to Perspective transformation matrix carries out perspective transform to translation region, to zoom in and out, to rotate to translation region, the processing such as deformation, makes Size, the shape in translation region and target area after obtaining perspective transform are consistent, with the translation region overlay after perspective transform Target area in current video frame.

With reference to Fig. 7, after determining the target area 702 in current video frame 701, determine that target area 702 is corresponding Translation region 703 is then based on target area 702 and translation region 703 and obtains perspective transformation matrix, is based on perspective transformation matrix Transformed translation region 704, transformed translation region are obtained to the progress deformation of translation region 703, rotation, scaling processing 704 is consistent with size, the shape of target area 702, finally with transformed 704 coverage goal region 702 of translation region, i.e., It can obtain the current video frame 705 comprising translation.

Mobile terminal is according to the corresponding method of step S301- step S304, to determining to collect after text to be translated Video frame handled frame by frame, the corresponding translation of text to be translated is fused to each video frame comprising text to be translated In, and show fused video frame, after such user can watch translation in real time on the display screen of mobile terminal Video pictures.

Further, in order to realize the translation effect seamless applying with original background in video to be translated, so that most Whole display effect is more life-like, and the method for processing video frequency of the embodiment of the present application is further comprising the steps of: extracting text to be translated The background texture in corresponding region in REF video frame；The background texture of extraction is determined as the corresponding background texture of translation.

When it is implemented, may recognize that word segment and background parts in first area, word segment using OCR engine Corresponding text to be translated, extracts the color or pattern of wherein background parts, corresponding with color or pattern the filling translation of extraction Textures, then translation is added in textures, finally obtain the corresponding textures of translation.Based on the above method, so that finally obtained The video frame of fusion translation can preferably retain background texture feature of the text to be translated in original video frame, enable translation It is enough seamless applying with background original in video, improve the fidelity of display effect.

Further, color, font of the corresponding text of text to be translated etc. can also be identified from REF video frame Parameter, when the corresponding translation of text to be translated is added in textures, can color based on the corresponding text of text to be translated, The parameters such as font determine the corresponding color of the translation shown in textures, font etc..For example, the corresponding text of text to be translated is The red, Song typeface is added to the color of the translation in textures then as red, and font is the Song typeface.

Based on any of the above embodiments, the method for processing video frequency of the embodiment of the present application further includes following steps: when Detect that picture variation meets the first preset condition in video to be translated, then redefines text to be translated.

When it is implemented, can be by the quantity for the characteristic point that detection current video frame includes, to detect in video to be translated The quantity of the case where picture changes, the characteristic point for including in current video frame is fewer, then shows current video frame and REF video The part of frame overlapping is fewer, may be comprising the emerging text for needing to translate in current video frame, at this time can be again from current A video frame is redefined out in the video to be translated obtained, as new REF video frame, from new REF video frame really Text to be translated is made, specific method can refer to step S202 and step S203.

Specifically, it is assumed that the quantity of the characteristic point extracted from REF video frame is the first numerical value, detects current video The quantity for the characteristic point for including in frame, obtains second value, if the ratio of second value and the first numerical value is less than preset threshold, Then redefine text to be translated.Wherein, preset threshold can be determined according to application demand, for example, 40% or 50% etc., the application Embodiment is not construed as limiting.

In practical application, when user needs to translate the text in other scenes, it is only necessary to mobile mobile terminal, With the text for needing to translate by the alignment of the pickup area of video capture device, at this point, mobile terminal detects in current video frame The ratio of the quantity for the characteristic point for including in the quantity and REF video frame of the characteristic point for including is less than preset threshold, redefines Text to be translated is then based on the text to be translated that redefines and executes step S301- step 304, will redefine to The corresponding translation of cypher text is fused in current video frame and is shown to user.

The method for processing video frequency of the embodiment of the present application can voluntarily judge the opportunity for redefining text to be translated, and be based on The text to be translated redefined completes tracking and translation to new text to be translated, therefore, when switching photographed scene, In the pickup area for the video capture device that user only needs new text to be translated being presented in the mobile terminal, so that it may The new corresponding translation of text to be translated is seen in video, entire translation process does not need user and implements any operation, to mention The experience of high user.

Based on any of the above embodiments, the method for processing video frequency of the embodiment of the present application further includes following steps: if It detects that the mobile terminal for acquiring video to be translated is in designated movement state, redefines text to be translated.

It is used to indicate the required movement for re-starting translation when it is implemented, can preset, such as shakes mobile terminal, When user shakes mobile terminal, text to be translated is redefined, step S301- is executed based on the text to be translated redefined Step 304, the corresponding translation of text to be translated redefined is fused in current video frame and is shown to user.It is specified Motion state is motion state locating for mobile terminal when user executes required movement, such as required movement is to shake movement eventually End redefines text to be translated when the slosh frequency for detecting mobile terminal is more than preset value.Specifically, can pass through Sensor built in mobile terminal gets the exercise data of the mobile terminal, and the exercise data got is analyzed and located Reason, so that it is determined that going out the slosh frequency of the mobile terminal.

In practical application, when the translation of more contents around user wants the text for knowing current shooting, it can adjust The position of mobile terminal makes text of the mobile terminal far from current shooting, to expand the pickup area of video capture device, thus The more contents in periphery are collected, still, since the region shot originally is also contained in the region of current shooting, when Include most characteristic point in the video frame of preceding acquisition, is unable to satisfy the first preset condition, also can not just trigger and redefine The operation of text to be translated.For above situation, required movement is can be performed in user, mobile terminal is such as shaken, when detecting movement When the slosh frequency of terminal is more than preset value, triggering redefines the operation of text to be translated, to obtain new text to be translated Sheet and corresponding translation.

Certainly, for above situation, user can also by inputting specified operation again, come trigger redefine it is to be translated The operation of text, the specified operation that mobile terminal is inputted in response to user, redefines text to be translated.

In practical application, after triggering redefines the operation of text to be translated, determined before can deleting to be translated The data such as text, translation, mapping relations and characteristic point.It is of course also possible to retain the text to be translated determined several times recently, The data such as translation, mapping relations and characteristic point redefine shooting when user is switched to photographed scene 2 from photographed scene 1 The data such as scene 2 corresponding text, characteristic point and mapping relations to be translated, and when user switches back into photographed scene 1, and can With data such as the photographed scene 1 based on storage corresponding text, translation, mapping relations and characteristic points to be translated, to shooting field Text to be translated in scape 1 carries out translation and shows, without redefining text to be translated, avoids repetition translation, improves Treatment effeciency.

When it is implemented, when by any of the above mode determine need to redefine text to be translated when, terminal device from REF video frame is reacquired in current collected video to be translated, carries turning over for the REF video frame to server transmission Request is translated, the text to be translated determined in the slave REF video frame that server returns is obtained, specific embodiment can refer to step Rapid S202 and step S203.Alternatively, REF video frame can also be reacquired from video to be translated by mobile terminal, this is called The OCR engine on ground determines text to be translated from REF video frame.

Based on any of the above embodiments, text to be translated can be determined from REF video frame by the following method This: identifying at least one text fragment from REF video frame；Determine the corresponding languages of text fragment；Based on not being target language The text fragment of kind, determines at least one text to be translated.

For example, when target language is English, it is only necessary to translate to the text fragment for not being English, therefore, be based on It is not that English text fragment determines text to be translated.Wherein, extended meeting after the method for text to be translated is determined based on text fragment It is discussed in detail.

When it is implemented, the corresponding languages of each text fragment can be determined by the following method: carrying out N- to text fragment Gram feature extraction processing, obtains several text fragments；Corresponding several text fragments input of text fragment is instructed in advance Experienced languages identification model obtains the corresponding languages of text fragment.

Wherein, N-gram feature extraction processing is carried out to text fragment, is that the character in text fragment is subjected to size It is operated for the sliding window of N, obtains the text fragments that several include N number of character.For example, a character is one in English A letter, a character is a Chinese character in Chinese.Assuming that N takes 4, by taking word " China " as an example, available " Chin ", " hina " two text fragments；By taking " I am Chinese " as an example, available " I is China ", " being Chinese " two text pieces Section.

When it is implemented, Bayesian model can be used in languages identification model, specifically, this is calculated by Bayesian model Several text fragments correspond to the probability of each candidate languages, and then the highest candidate languages of select probability are as text fragment pair The languages answered.Languages identification model utilizes Bayesian model, carries out machine learning training based on existing languages data, is compared More stable model, trained languages identification model can be realized the corresponding languages of automatic identification text, wherein languages data Finger marks the text having in advance.

When it is implemented, also languages identification model can be obtained by training neural network, i.e., using sample set to nerve net Network is trained, and sample set includes largely marking the text having in advance.Sample is carried out at N-gram feature extraction Reason, obtains several text fragments, this several text fragments is inputted neural network, obtains the corresponding predicted value of the sample, The predicted value characterizes the probability that the sample corresponds to each languages, by loss function calculate sample it is corresponding label and predicted value it Between penalty values, be then based on loss function backpropagation calculate neural network in each weight parameter gradient, and based on ladder Degree updates each weight parameter in neural network.Circulation executes above-mentioned training step, until obtaining satisfactory nerve net For network to get final languages identification model is arrived, trained languages identification model can be realized the corresponding language of automatic identification text Kind.The specific training method of neural network is the prior art, is repeated no more.

Based on any of the above embodiments, text to be translated is determined from REF video frame by the following method: If identify a text fragment from REF video frame, text paragraph is determined as text to be translated；If being regarded from benchmark Multiple text fragments are identified in frequency frame, processing are merged to multiple text fragments according to preset strategy, by the text after merging This paragraph is determined as text to be translated.

When it is implemented, processing can be merged to text fragment in the following way: if adjacent in REF video frame Meet the second preset condition between the corresponding text parameter of multiple text fragments, then adjacent multiple text fragments is merged into one Text fragment after merging is determined as text to be translated by a text fragment.

In the embodiment of the present application, text parameter includes but is not limited at least one of following: text fragment is in REF video frame In location information, the corresponding languages of text fragment, the corresponding font of text fragment, the corresponding text size of text fragment and The corresponding text color of text fragment.Wherein, location information of the text fragment in REF video frame, which can be, utilizes OCR engine A series of coordinate values of the text fragment got in REF video frame determine text to be translated based on this some column coordinate value This corresponding region in REF video frame.Second preset condition can be determined according to the text parameter of selection.

When text parameter is location information of the text fragment in REF video frame, according to the corresponding position of text fragment Information determines the distance between adjacent text fragment value, and when distance value is less than distance threshold, adjacent text fragment is closed It and is a text fragment.Wherein, distance threshold can be predefined by those skilled in the art's binding test data.

It, can if the corresponding languages of adjacent text fragment are identical when text parameter languages corresponding for text fragment Adjacent text fragment is merged into a text fragment.

It, can if the corresponding font of adjacent text fragment is identical when text parameter font corresponding for text fragment Adjacent text fragment is merged into a text fragment.

When text parameter text size corresponding for text fragment, if the corresponding text size phase of adjacent text fragment Together, then adjacent text fragment can be merged into a text fragment.

When text parameter text color corresponding for text fragment, if the corresponding text color phase of adjacent text fragment Together, then adjacent text fragment can be merged into a text fragment.

When it is implemented, determining whether to merge adjacent text fragment in combination with above-mentioned multiple text parameters Processing, to improve combined accuracy rate.

For example, when text parameter is location information of the text fragment in REF video frame, it is corresponding according to text fragment Location information, calculate the distance between two adjacent text fragments value, if distance value be not less than distance threshold, not to this Two text fragments merge processing；If distance value is less than distance threshold, the corresponding languages of the two text fragments are judged It is whether identical；If languages are different, processing is not merged to the two text fragments；If languages are identical, the two are judged Whether the corresponding font of text fragment, text color, text size identical, if not identical, not to the two text fragments into The two adjacent text fragments are then merged into a text fragment if they are the same by row merging treatment.

In practical application, due to shooting angle, light etc., the text of the Text region based on same color, size out Word color, text size may have deviation, therefore, when the difference of the corresponding text color of two text fragments is less than first When error amount, i.e., it is believed that the corresponding text color of the two text fragments is identical, when the corresponding text of two text fragments is big When small difference is less than the second error amount, i.e., it is believed that the corresponding text size of the two text fragments is identical.Wherein, it first misses Difference and the second error amount can be determined that the embodiment of the present application is not construed as limiting by those skilled in the art according to test result.

When it is implemented, adjacent text fragment can be merged into a text fragment one by one by the above method, until It is all unable to satisfy the second preset condition between the corresponding text parameter of any two text fragment, at this point, final merging is obtained Text fragment be determined as text to be translated, finally merge the corresponding text to be translated of an obtained text fragment.

With reference to Fig. 8 A, multiple text fragments are determined from current video frame, each dotted line frame is one corresponding in Fig. 8 A Text fragment, text fragment 802-1 is different from the color of text fragment 802-2, therefore without merging treatment；Text fragment 802-2 is different from the languages of text fragment 802-3, also without merging treatment；Text fragment 802-3 and text fragment 802-4 The distance between be more than distance threshold, also without merging treatment；Between text fragment 802-4 and text fragment 802-5 away from From distance threshold is less than, and the languages of text fragment 802-4 and text fragment 802-5, font, text size, text color are equal It is identical, therefore, text fragment 802-4 and text fragment 802-5 are merged into a text fragment；Text fragment 802-5 and text The distance between this paragraph 802-6 is more than distance threshold, without merging treatment；Finally from REF video frame shown in Fig. 8 A Text fragment after obtaining five merging, respectively " secreting spy ", " compound azintamide enteric coatel tablets ", " Compound Azimtamide Enteric-coated Tablets ", " it is used for the disease because caused by bile secretion is insufficient or digestion azymia Shape ", " 20 is on chip ", since preset target language is English, finally determine four texts " secreting spy " to be translated, " multiple Square azintamide enteric coatel tablets ", " for the symptom because caused by bile secretion is insufficient or digestion azymia ", " 20 is on chip ", by this Four texts to be translated are translated as English, obtain corresponding translation " Saudi compound azinimide ", " Compound Azimtamide Enteric-coated Tablets”、“Used for symptons caused by insufficient Bile secretion or lack of digestive enzymes ", then, it is determined that this corresponding textures of four translations.? The corresponding target area of aforementioned four text to be translated is determined in video frame shown in Fig. 8 B, is determined based on target area each Translation region in textures obtains the view that fusion as shown in Figure 8 B has translation with the corresponding target area of translation region overlay Frequency frame.

Shown in Fig. 8 A, if " or will not cause " for because of bile secretion deficiency " and due to digesting azymia Symptom " the two text fragments are merged into a paragraph, two translation " Used for will be respectively obtained Insufficient bile secretion " and " Or symptoms caused by deficiency of digestive Enzymes " causes translation result inaccurate, influences the reading of user.

In practical application, because same a word or the same word in picture may not be same the reason of text composition In row, in this case, it can be identified as different texts to be translated with a word or the same word, semanteme when translation is caused to be managed Solution deviates, to reduce the accuracy of translation or lead to translation failure.For this purpose, the video of the embodiment of the present application is handled Method, to identifying that multiple text fragments merge processing from video to be translated, is preset before being translated by meeting The text fragment of strategy is merged into a text fragment, as a text to be translated, then to the text to be translated determined It is translated, can sufficiently combine context in this way, improve the accuracy of translation.

As shown in figure 9, being based on inventive concept identical with above-mentioned method for processing video frequency, the embodiment of the present application also provides one Kind video process apparatus 90, including obtain module 901, target area determining module 902, translation area determination module 903 and melt Mold block 904.

Module 901 is obtained to be used to obtain the current video frame in the video to be translated acquired in real time.

Target area determining module 902 is used to determine the target area for the text to be translated for including in the current video frame Domain.

Translation area determination module 903 is for determining that the target area is corresponding in the translation of the text to be translated Translation region.

Fusion Module 904 is for will show the current video frame behind target area described in the translation region overlay.

Optionally, the video process apparatus 90 of the embodiment of the present application further includes mapping block, is used for: from determine it is described to Characteristic point is extracted in the REF video frame of cypher text, the REF video frame is current video described in the video to be translated Any video frame before frame；Determine the text to be translated corresponding region and characteristic point in the REF video frame Mapping relations between the location information in the REF video frame；

Correspondingly, target area determining module 902 is specifically used for: determining the characteristic point in the current video frame Location information；According to the position of the corresponding mapping relations of text to be translated and the characteristic point in the current video frame Confidence breath, determines target area of the text to be translated in the current video frame.

Optionally, the Fusion Module 904 is specifically used for: according to the target area, adjusting the big of the translation region Small and shape；By target area described in translation region overlay adjusted.

Optionally, the video process apparatus 90 of the embodiment of the present application further includes texture processing module, is used for: extract it is described to The background texture in cypher text corresponding region in the REF video frame；The background texture of extraction is determined as the translation Corresponding background texture.

Optionally, the video process apparatus 90 of the embodiment of the present application further includes text identification module, is used for: in response to user The specified operation of input, or detect that the mobile terminal for acquiring the video to be translated is in designated movement state, or detect Picture variation meets the first preset condition in the video to be translated, redefines text to be translated.

Optionally, the text identification module is specifically used for: REF video frame is reacquired from the video to be translated, Sent to server and carry the translation request of the REF video frame, obtain that the server returns from the REF video The text to be translated determined in frame.

Optionally, the text identification module is specifically used for: REF video frame is reacquired from the video to be translated, Text to be translated is determined from the REF video frame.

Optionally, the text identification module is specifically used for: if identifying multiple text chunks from the REF video frame It falls, processing is merged to the multiple text fragment according to preset strategy, the text fragment after merging is determined as to be translated Text.

Optionally, the text identification module is specifically used for: if multiple text fragments adjacent in the REF video frame Meet the second preset condition between corresponding text parameter, then adjacent multiple text fragments is merged into a text chunk It falls, the text fragment after merging is determined as text to be translated；Wherein, the text parameter includes at least one of the following: described Location information, the text fragment corresponding languages, the text fragment of the text fragment in the REF video frame are corresponding Font, the corresponding text size of the text fragment and the corresponding text color of the text fragment.

Optionally, the text identification module is specifically used for: identifying at least one text from the REF video frame Paragraph；Determine the corresponding languages of the text fragment；Based on the text fragment for not being target language, determine that at least one waits turning over Translation sheet, the target language are the corresponding languages of the translation.

Optionally, the video process apparatus 90 of the embodiment of the present application further includes languages identification module, is used for: to the text Paragraph carries out N-gram feature extraction processing, obtains several text fragments；By several corresponding texts of the text fragment Segment input languages identification model trained in advance, obtains the corresponding languages of the text fragment.

The video process apparatus and above-mentioned method for processing video frequency that the embodiment of the present application mentions use identical inventive concept, energy Identical beneficial effect is enough obtained, details are not described herein.

Based on inventive concept identical with above-mentioned method for processing video frequency, the embodiment of the present application also provides a kind of electronics to set Standby, which is specifically as follows desktop computer, portable computer, smart phone, tablet computer, personal digital assistant (Personal Digital Assistant, PDA), server, VR equipment, AR equipment etc..As shown in Figure 10, the electronic equipment 100 may include processor 1001 and memory 1002.

Processor 1001 can be general processor, such as central processing unit (CPU), digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic Perhaps transistor logic, discrete hardware components may be implemented or execute in the embodiment of the present application to disclose for device, discrete gate Each method, step and logic diagram.General processor can be microprocessor or any conventional processor etc..In conjunction with this The step of method disclosed in application embodiment, can be embodied directly in hardware processor and execute completion, or in processor Hardware and software module combination execute completion.

Memory 1002 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module.Memory may include the storage medium of at least one type, such as May include flash memory, hard disk, multimedia card, card-type memory, random access storage device (Random Access Memory, RAM), static random-access memory (Static Random Access Memory, SRAM), programmable read only memory (Programmable Read Only Memory, PROM), read-only memory (Read Only Memory, ROM), electrification can Erasable programmable read-only memory (EPROM) (Electrically Erasable Programmable Read-Only Memory, EEPROM), magnetic storage, disk, CD etc..Memory can be used for carrying or storing have instruction or data structure The desired program code of form and can by any other medium of computer access, but not limited to this.The embodiment of the present application In memory 1002 can also be circuit or it is other arbitrarily can be realized the device of store function, for storing program instruction And/or data.

The embodiment of the present application provides a kind of computer readable storage medium, for being stored as above-mentioned electronic equipments Computer program instructions, it includes the programs for executing above-mentioned barrage processing method.

Above-mentioned computer storage medium can be any usable medium or data storage device that computer can access, packet Include but be not limited to magnetic storage (such as floppy disk, hard disk, tape, magneto-optic disk (MO) etc.), optical memory (such as CD, DVD, BD, HVD etc.) and semiconductor memory (such as it is ROM, EPROM, EEPROM, nonvolatile memory (NAND FLASH), solid State hard disk (SSD)) etc..

More than, above embodiments are only described in detail to the technical solution to the application, but above embodiments The method for illustrating to be merely used to help understand the embodiment of the present application should not be construed as the limitation to the embodiment of the present application.This technology Any changes or substitutions that can be easily thought of by the technical staff in field, should all cover within the protection scope of the embodiment of the present application.

Claims

1. a kind of method for processing video frequency characterized by comprising

2. the method according to claim 1, wherein further include:

Characteristic point is extracted from the REF video frame for determining the text to be translated, the REF video frame is described to be translated Any video frame before current video frame described in video；

Determine the text to be translated in the REF video frame corresponding region and the characteristic point in the REF video The mapping relations between location information in frame；

According to the position letter of the corresponding mapping relations of text to be translated and the characteristic point in the current video frame Breath, determines target area of the text to be translated in the current video frame.

3. according to the method described in claim 2, it is characterized in that, described by target area described in the translation region overlay, It specifically includes:

By target area described in translation region overlay adjusted.

4. according to the method described in claim 3, it is characterized by further comprising:

5. according to claim 1 to any method in 4, which is characterized in that further include:

In response to the specified operation of user's input, or detect that the mobile terminal for acquiring the video to be translated is in designated movement State, or detect that picture variation meets the first preset condition in the video to be translated, redefines text to be translated.

6. according to the method described in claim 5, specifically including it is characterized in that, described redefine text to be translated:

REF video frame is reacquired from the video to be translated, carries turning over for the REF video frame to server transmission Request is translated, the text to be translated determined from the REF video frame that the server returns is obtained；

Alternatively,

REF video frame is reacquired from the video to be translated, determines text to be translated from the REF video frame.

7. according to the method described in claim 6, it is characterized in that, being determined from the REF video frame by the following method Text to be translated:

If identifying multiple text fragments from the REF video frame, the multiple text fragment is carried out according to preset strategy Text fragment after merging is determined as text to be translated by merging treatment.

8. the method according to the description of claim 7 is characterized in that it is described according to preset strategy to the multiple text fragment into Text fragment after merging is determined as text to be translated, specifically included by row merging treatment:

If meeting the second preset condition between the adjacent corresponding text parameter of multiple text fragments in the REF video frame, Adjacent multiple text fragments are merged into a text fragment, the text fragment after merging is determined as text to be translated This；

Wherein, the text parameter includes at least one of the following: position letter of the text fragment in the REF video frame Breath, the corresponding languages of the text fragment, the corresponding font of the text fragment, the corresponding text size of the text fragment with And the corresponding text color of the text fragment.

9. according to the method described in claim 6, it is characterized in that, being determined from the REF video frame by the following method Text to be translated:

At least one text fragment is identified from the REF video frame；

Determine the corresponding languages of the text fragment；

Based on the text fragment for not being target language, determine that at least one text to be translated, the target language are described translate The corresponding languages of text.

10. according to the method described in claim 9, it is characterized in that, the corresponding languages of the determination text fragment, specifically Include:

By corresponding several text fragments input of text fragment languages identification model trained in advance, the text is obtained The corresponding languages of paragraph.

11. a kind of video process apparatus characterized by comprising

Translation area determination module, for determining the target area corresponding translation area in the translation of the text to be translated Domain；

12. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes any one of claims 1 to 10 side when executing the computer program The step of method.

13. a kind of computer readable storage medium, is stored thereon with computer program instructions, which is characterized in that the computer journey The step of any one of claims 1 to 10 the method, is realized in sequence instruction when being executed by processor.