CN118116022A - Image processing method, intelligent terminal, device, medium and program product - Google Patents

Image processing method, intelligent terminal, device, medium and program product Download PDF

Info

Publication number
CN118116022A
CN118116022A CN202410236725.4A CN202410236725A CN118116022A CN 118116022 A CN118116022 A CN 118116022A CN 202410236725 A CN202410236725 A CN 202410236725A CN 118116022 A CN118116022 A CN 118116022A
Authority
CN
China
Prior art keywords
gesture
resolution
images
region
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410236725.4A
Other languages
Chinese (zh)
Inventor
王淼军
郝冬宁
陈芳
寸毛毛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei Xingji Meizu Group Co ltd
Original Assignee
Hubei Xingji Meizu Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei Xingji Meizu Group Co ltd filed Critical Hubei Xingji Meizu Group Co ltd
Priority to CN202410236725.4A priority Critical patent/CN118116022A/en
Publication of CN118116022A publication Critical patent/CN118116022A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/147Determination of region of interest

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided are an image processing method of an intelligent terminal, an image processing method of a computing device, an electronic device, a non-transitory storage medium, and a computer program product. The image processing method of the intelligent terminal comprises the following steps: obtaining, by an intelligent terminal, a plurality of first images of a first resolution; transmitting the plurality of first images to a computing device connected to the intelligent terminal; receiving a recognition determination that the computing device recognized a first gesture in the plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture; triggering the intelligent terminal to shoot a second image with a second resolution, and intercepting a region of interest in the second image according to information related to a start point and an end point of the region pointed by the first gesture, wherein the second resolution is higher than the first resolution; and the intelligent terminal sends the region of interest to the computing device so as to receive translated text which is recognized and translated by the computing device and contained in the region of interest.

Description

Image processing method, intelligent terminal, device, medium and program product
Technical Field
The present application relates to the field of intelligent terminals, and more particularly, to an image processing method of an intelligent terminal, an image processing method of a computing device, an electronic device, a non-transitory storage medium, and a computer program product.
Background
Conventional mobile-end smart devices are typically supported by a mobile phone. With the development of everything interconnection, the mobile terminal intelligent device gradually transits to the wearable intelligent device. Smart glasses, as one of wearable devices, provide many smart functions such as immersive visual rendering (virtual reality (VR), augmented Reality (AR), mixed Reality (MR), etc.), smart travel, smart navigation, photo translation, etc. Compared with a mobile phone, the intelligent glasses can conveniently acquire images according to the face orientation of a wearer. For example, in terms of photographing translation, it is more convenient to use smart glasses than a mobile phone.
Disclosure of Invention
According to one aspect of the present application, there is provided an image processing method of an intelligent terminal, including: obtaining, by an intelligent terminal, a plurality of first images of a first resolution; transmitting the plurality of first images to a computing device connected to the intelligent terminal; receiving a recognition determination that the computing device recognized a first gesture in the plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture; triggering the intelligent terminal to shoot a second image with a second resolution, and intercepting a region of interest in the second image according to information related to a start point and an end point of the region pointed by the first gesture, wherein the second resolution is higher than the first resolution; and the intelligent terminal sends the region of interest to the computing device so as to receive translated text which is recognized and translated by the computing device and contained in the region of interest.
According to another aspect of the present application, there is provided an intelligent terminal, including a photographing device and a transceiver device, wherein the photographing device obtains a plurality of first images of a first resolution, and the transceiver device transmits the plurality of first images to a computing device connected to the intelligent terminal; the receiving and transmitting device receives the identification judgment that the computing device identifies the first gesture in the plurality of first images and the information related to the starting point and the ending point of the area pointed by the first gesture in the plurality of first images, the shooting device triggers shooting a second image with a second resolution, and intercepts the region of interest in the second image according to the information related to the starting point and the ending point of the area pointed by the first gesture, wherein the second resolution is higher than the first resolution; the transceiver device sends the region of interest to the computing device so as to receive translated text which is recognized and translated by the computing device from text contained in the image of interest.
According to another aspect of the present application, there is provided an image processing method including: receiving a plurality of first images with a first resolution sent from the intelligent terminal; identifying a first gesture in the plurality of first images and determining a start point and an end point of a region pointed to by the first gesture in the plurality of first images based on the identified first gesture; triggering the intelligent terminal to shoot a second image with a second resolution and sending a start point and an end point of the area pointed by the first gesture to the intelligent terminal so that the intelligent terminal obtains a region of interest with the second resolution according to the start point and the end point of the area pointed by the first gesture, wherein the region of interest is commonly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image into the second image with the second resolution; and receiving the region of interest from the intelligent terminal so as to perform character recognition and translation according to the region of interest, wherein the second resolution is higher than the first resolution.
According to another aspect of the present application, there is provided a computing device comprising: a receiving device configured to receive a plurality of first images of a first resolution transmitted from the intelligent terminal; a recognition device configured to recognize a first gesture in the plurality of first images and determine a start point and an end point of a region pointed to by the first gesture in the plurality of first images according to the recognized first gesture; a transmitting device configured to trigger the intelligent terminal to shoot a second image with a second resolution, and transmit a start point and an end point of the area pointed by the first gesture to the intelligent terminal, so that the intelligent terminal obtains a region of interest with the second resolution according to the start point and the end point of the area pointed by the first gesture, wherein the region of interest is commonly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image into the second image with the second resolution; and receiving the region of interest from the intelligent terminal so as to perform character recognition and translation according to the region of interest, wherein the second resolution is higher than the first resolution.
According to another aspect of the present application, there is provided an electronic apparatus including: a memory for storing instructions; a processor for reading the instructions in the memory and performing a method according to an embodiment of the application.
According to another aspect of the application, there is provided a non-transitory storage medium having instructions stored thereon, wherein the instructions, when read by a processor, cause the processor to perform a method according to an embodiment of the application.
According to another aspect of the application, there is provided a computer program product comprising computer instructions, wherein the instructions, when read by a processor, cause the processor to perform a method according to an embodiment of the application.
In this way, the part of the high-resolution second image related to the pointed area is translated by transmitting the low-resolution first image and identifying the pointed area by using the low-resolution first image instead of translating the whole image, so that the transmission data volume is reduced, the data volume for identifying the pointed area is reduced, only the pointed area is translated in a targeted manner, and the accuracy of the translation of the pointed area is ensured.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
Fig. 1 illustrates a scene graph to which various embodiments in accordance with the application are applied.
Fig. 2 shows a flowchart of an image processing method of an intelligent terminal according to an embodiment of the present application.
Fig. 3 shows an exemplary diagram of a first gesture with a finger of a human right hand as a pointing object according to an embodiment of the present application.
FIG. 4A illustrates a flowchart of a process by which a computing device recognizes a recognition determination of a first gesture in a plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture, in accordance with an embodiment of the present application.
FIG. 4B illustrates a flowchart of another process by which a computing device recognizes a recognition determination of a first gesture in a plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture, in accordance with embodiments of the present application.
Fig. 5A shows a schematic diagram of the upper left and lower right points of the rectangular box with the start and end points of the region pointed to by the first gesture as the region of interest, respectively, according to an embodiment of the present application.
Fig. 5B shows a schematic diagram of the starting point and the ending point of the region pointed to by the first gesture as the lower left and lower right points of the rectangular box of the region of interest (the rectangular box of the region of interest has a predetermined height), respectively, according to an embodiment of the present application.
Fig. 6 is a flowchart illustrating a process of rendering a broadcast of a translation by an intelligent terminal according to an embodiment of the present application.
Fig. 7 is a schematic diagram illustrating rendering and broadcasting of translations by an intelligent terminal according to an embodiment of the present application.
FIG. 8 illustrates a flowchart of a method of image processing for a computing device according to an embodiment of the application.
Fig. 9 shows a block diagram of a smart terminal according to an embodiment of the present application.
FIG. 10 illustrates a block diagram of a computing device according to an embodiment of the application.
FIG. 11 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the application.
FIG. 12 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the application.
Detailed Description
Reference will now be made in detail to the present embodiments of the application, examples of which are illustrated in the accompanying drawings. While the application will be described in conjunction with the specific embodiments, it will be understood that it is not intended to limit the application to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the application as defined by the appended claims. It should be noted that the method steps described herein may be implemented by any functional block or arrangement of functions, and any functional block or arrangement of functions may be implemented as a physical entity or a logical entity, or a combination of both.
As a wearable device, smart glasses are designed to be fashionable and lightweight. The current touch type physical interaction mode of the intelligent glasses is very inconvenient, the self calculation power of the intelligent glasses is limited, and the inference calculation of an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) model and the like cannot be directly executed at the glasses end. In addition, considering the endurance of the smart glasses, many time-consuming and labor-consuming processes are generally completed by means of interconnection of external devices, such as interconnection of the smart glasses and external mobile phones.
In the existing photographing and translating process based on interconnection of the intelligent glasses and the mobile phone, the intelligent glasses acquire preview images shot by the intelligent glasses and display the preview images in real time, when photographing and translating are needed, a photographer aims at an object to be translated, a touch pad on the side face of the intelligent glasses is clicked to trigger photographing, the intelligent glasses acquire photographing images and transmit the photographing images to the mobile phone, and after image translating (comprising optical character recognition (Optical Character Recognition, OCR) and text translating) is executed by the mobile phone, a translating result is returned to the intelligent glasses. After the smart glasses receive the translation result, a Text To Speech (TTS) function is called To convert the Text into Speech, and then voice broadcasting is performed.
However, the existing whole photographing translation process has the following problems:
(1) Image scaling affects the translation effect. The existing photographing translation is often to transmit the whole image to a mobile phone end through interconnection for image translation, and the quantity of interconnection transmission data is large. In order to reduce the time consumption of the transmission link, the image with large resolution is sometimes scaled, and the amount of transmission data is reduced by reducing the resolution. However, the resolution is reduced, which results in reduced image definition, and affects the text extraction result and thus the translation result.
(2) Redundant data is transmitted and backfill fills the deformation. In the prior art, the whole image is transmitted to the mobile phone end to perform OCR text recognition and translation of the characters in the image, but in actual business, a user may only need to translate part of the text contents in the image. And the existing image translation involves post-processing of translation backfill, after translation, character lengths of a source language and a target language may not be consistent, resulting in a filling deformation generated by backfilling the source language with the translation of the target language.
(3) Translation speed is affected and the amount of data returned is large. The existing image translation is to perform OCR text recognition and translation on the whole image, the translation speed is low, and the returned translation text results are more, so that the interconnection feedback data amount is large.
(4) The user operation is complicated and time-consuming. Generally, after the translation result is obtained by the glasses end, the cloud TTS needs to be called for text broadcasting, and the existing processing mode is to display the TTS texts in a list mode, and a user uses a touch pad to slide up and down for broadcasting selection. Under the scene that more texts are involved in the whole image, a user needs to find the text to be translated first and then broadcast the text.
Aiming at the defects of the existing shooting translation technology for interconnection of intelligent glasses and mobile phones, the disclosure provides an image processing method of an intelligent terminal, which comprises the following steps: obtaining, by an intelligent terminal, a plurality of first images of a first resolution; transmitting the plurality of first images to a computing device connected to the intelligent terminal; receiving a recognition judgment that the computing device recognizes a first gesture in a plurality of first images and information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images; triggering the intelligent terminal to shoot a second image with a second resolution, and intercepting a region of interest in the second image according to information related to a start point and an end point of the region pointed by the first gesture, wherein the second resolution is higher than the first resolution; and sending the second image to the computing device by the intelligent terminal so as to receive the translated translation of the text second image contained in the region of interest by the computing device.
In this way, the part of the high-resolution second image related to the pointed area is translated by transmitting the low-resolution first image and identifying the pointed area by using the low-resolution first image instead of translating the whole image, so that the transmission data volume is reduced, the data volume for identifying the pointed area is reduced, only the pointed area is translated in a targeted manner, and the accuracy of the translation of the pointed area is ensured.
Fig. 1 illustrates a scene graph to which various embodiments in accordance with the application are applied.
As shown in fig. 1, the smart terminal 110 is, for example, smart glasses, on which a camera 111 for capturing an image is arranged. The computing device 120 is, for example, a mobile terminal such as a cell phone, tablet computer, or the like. In some embodiments, the computing device 120 may also be a cloud-side computer (e.g., a cloud-side server).
In applying various embodiments of the present application, camera 111 of intelligent terminal 110 is capable of acquiring successive preview images (e.g., a third image as subsequently mentioned herein) of medium 130 (e.g., paper, a page, an electronic screen, etc.). Successive preview images can be displayed in real time (e.g., VR, AR, etc., or on an ophthalmic lens, etc.) on the intelligent terminal at a preview resolution (e.g., a third resolution, e.g., 1080 x 720, as later referred to herein). At the same time, the continuous preview image may be scaled to a continuous thumbnail (e.g., the first image referred to later herein, having the first resolution referred to later herein, e.g., 320 x 240) that can be transmitted to the computing device 120 so that the computing device 120 can recognize that a particular gesture (e.g., translation gesture) of the pointing object (e.g., hand) 140 is present in the thumbnail and recognize the region of interest 131 to be translated that is pointed to by the pointing object and inform the intelligent terminal 110 of information about the region of interest 131 to be translated.
The camera 111 of the intelligent terminal 110 (at a second resolution, e.g. 1920 x 1080, mentioned later herein) is capable of taking an image of an object to be translated (e.g. a region of interest 131 to be translated on the medium 130) and of intercepting part of the region of interest 131 in fig. 1 (e.g. a second image, mentioned later herein) for transmission to the computing device 120.
After performing the translation of the portion of the image, including optical character recognition (Optical Character Recognition, OCR) and text translation, the computing device 120 returns the translation result translation to the intelligent terminal 110.
After the translation result translation is received by the intelligent terminal 110, a Text To Speech (TTS) function of a cloud or local or other device (e.g., the computing device 120) can be invoked To convert the translation result translation into Speech, and then render and display the translation result translation on the intelligent terminal 110 and perform synchronous voice broadcasting.
Specific details of various embodiments of the disclosure are described in further detail below in conjunction with the figures and embodiments.
Fig. 2 shows a flowchart of an image processing method 200 of the intelligent terminal according to an embodiment of the present application.
Here, the smart terminal may include a wearable device with a camera or a video camera, such as smart glasses. Herein, smart glasses are described as an example. Related application programs can be installed on the intelligent terminal so as to perform functions of setting, displaying, photographing, transmitting, voice broadcasting and the like.
Here, the computing device may include a mobile terminal (or cloud server) with processing capabilities, such as a mobile phone, a tablet computer, etc., on which related applications may be installed for performing functions of image recognition, image text conversion, text translation, etc. In this context, a mobile phone is described as an example.
After a user wears a smart terminal such as smart glasses, a camera on the smart glasses may acquire an image of something in front of the smart glasses. If a user is looking at a foreign text article such as english on a medium such as a book, tablet computer, etc., then a camera on the smart glasses may acquire a continuous plurality of images of the medium including the english article, and if the user points to a specific english sentence or paragraph on the medium with a finger or other pointing object such as a pointing pen, the camera on the smart glasses may acquire a plurality of images of the medium including the english article and the user's finger or pointing pen, etc.
The preview attribute of the camera of the smart glasses and the photographing attribute of the camera of the smart glasses may be set on the smart terminal in advance, for example. Here, "preview" means that real-time images acquired by a camera are continuously acquired and continuously displayed as preview images, and can be regarded as displaying moving images. And "take a picture" refers to triggering the capture of (at least one) still image.
The preview attributes mainly include resolution of the preview image, sampling frame rate of the preview image, and the like. The photographing attribute mainly includes resolution of a photographed image, and the like.
Assuming that the user sees the english word or sentence or paragraph to be translated, the most convenient way for the user is to scratch out the english portion to be translated with his finger or pointing object such as a pointer pen (e.g., left to right). Thus, in this context, it is desirable to be able to trigger text OCR recognition and translation of the english portion, etc., by recognizing such a translation gesture (gesture) of the pointing object in real time by some means.
In this context, it is desirable to conduct real-time recognition of a particular gesture, such as a translation gesture, at a computing device based on a preview image, considering that if the resolution setting is too low, the near-eye display is not clear at the smart glasses, but if the resolution setting is too high, it may result in a large amount of data that is then transmitted to the computing device internationally, resulting in a slow transmission speed, and a frame rate of translation gesture recognition is not achieved. In combination, it is preferable to generally set the resolution of the preview image (e.g., the third resolution herein) to 1080 x 720 to meet the near-eye display definition.
The preview image (e.g., the third image herein) may be scaled prior to transmission to the computing device to obtain a low resolution (e.g., the first image herein) thumbnail (e.g., the first image herein) for transmission to reduce the amount of data transmitted and ensure a faster image transmission frame rate. The resolution of the thumbnail only needs to meet the requirement of translation gesture recognition, so that the resolution of the thumbnail can be set to 320 x 240 in practice, the transmission frame rate can be 25 to 30fps, and the key point detection and translation gesture recognition of the translation gesture can be normally completed.
Whereas considering that OCR and translation etc. need to be performed based on a photographed image (e.g. the second image herein), a resolution as large as possible may be set to ensure sharpness for more accurate OCR, the resolution of the photographed image may be set to 1920×1080 (e.g. the second resolution herein) in practical use.
It can be seen that the second resolution of the photographed image is greater than the third resolution of the preview image, which is greater than the first resolution of the thumbnail of the preview image.
Thus, in one embodiment, a plurality of third images (e.g., preview images for preview) for preview are acquired in real time by the intelligent terminal with a third resolution prior to step 210; the plurality of third images is abbreviated as a plurality of first images (e.g., thumbnails for sending to the computing device for particular gesture recognition)), where the third resolution is greater than the first resolution and less than the second resolution.
The interconnection channel between the intelligent terminal and the computing device may be established when the intelligent terminal and the computing device are powered on, may be established before step 210, may be established before step 220, or may be established when needed. The interconnection may include short range wireless communication such as bluetooth.
As shown in fig. 2, the image processing method 200 includes at least steps 210-250.
At step 210, a plurality of first images of a first resolution are obtained by a smart terminal.
As above, the plurality of first images may be scaled from a (dynamically continuous) plurality of third images as preview images, where scaling refers to reducing the resolution.
Of course, the preview image may also be directly taken as the first image without a scaling step, which may depend on the resolution of the preview image and the requirements of the transmission rate (e.g., the resolution of the preview image is already small, or the transmission bandwidth between the intelligent terminal and the computing device is large enough to quickly transmit a large number of preview images, etc.).
At step 220, a plurality of first images are sent to a computing device connected to the intelligent terminal.
Here, since computing resources and computing power of a computing device such as a mobile phone are generally higher than those of a smart terminal such as smart glasses. Thus, a computing device may be utilized to perform recognition calculations for gesture recognition. The computing device may identify a first gesture in the plurality of first images and determine a start point and an end point of a region of the plurality of first images pointed to by the first gesture. The first gesture may be set as a translation action that can embody an intent of a user to translate a portion of text. Various image recognition algorithms may be employed herein for gesture recognition, including various machine learning algorithms, and the like.
In this context, it is assumed that when a user wants to translate a portion of text, the user typically extends at least one finger (e.g., typically an index finger) or holds a pointer pen for a period of time at a start point location of the portion of text to be translated, then slides the index finger or holds the pointer pen until the user does not want to translate the text, and at an end point the user holds an index finger or holds a pointer pen for an additional period of time at an end point location of the portion of text to be translated, the start point location and the end point location being used to represent the portion of text in the entire text that the user is interested in translating.
Based on the above assumptions, a variety of recognition schemes can be devised that recognize a user's translation intent for a portion of text.
In one embodiment, where a human finger is used as the pointing object, the first gesture may include at least one finger extending and hovering over a fixed location of the plurality of first images beyond a first time threshold.
Fig. 3 shows an exemplary diagram of a first gesture with a finger of a human right hand as a pointing object according to an embodiment of the present application.
As shown in fig. 3, this is the right hand of a person with the index finger of the right hand extended, which is the more common pointing habit of a human hand. Of course, the present disclosure is not limited to the right hand of a person, and if a person is left handed, identification may also be made in a similar manner as set forth herein.
In order to accurately recognize such a gesture, 21 parts of the hand may be set to 21 key points (not shown in the drawing). For simplicity, only the keypoints 8, 5, 17 of the hand are labeled in fig. 3, with the keypoint 8 located at the nail cover of the right index finger, the keypoint 5 located at the metacarpophalangeal joint of the index finger, and the keypoint 17 located at the metacarpophalangeal joint of the little finger.
The "index finger extended" state (which may be referred to as a translation gesture) may be considered identified if two conditions are determined in the image to be satisfied: (1) The key point 8 is at the upper left of the rest 20 key points (namely, the rest key points except the key point 8 in the 21 key points of the hand), namely, the abscissa and the ordinate of the point 8 in the image are judged to be the minimum value in the 21 key points; (2) The keypoint 5 is to the left of the keypoint 17, i.e. it is determined that the abscissa of the keypoint 5 in the image is smaller than the abscissa of the keypoint 17.
Of course, the above-mentioned manner of identifying the "index finger protruding" state adopts a key point manner, but the disclosure is not limited thereto, and other manners of identifying the "index finger protruding" state are also possible in practice, for example, some machine learning algorithms and the like are adopted. But also other finger extensions, or at least one finger extension, etc.
After the "index finger extended" condition is identified, it is further determined that the index finger has remained in a fixed position on the plurality of first images beyond a first time threshold. The first time threshold is set to prevent erroneous recognition in the case where the user does not have a translation intention, but simply points somewhere with his hand, and does not stay long enough.
In the case of a pointing pen as a pointing object, the first gesture may include the tip of the pointing pen resting on a fixed location of the plurality of first images for more than a predetermined time threshold. In this case, the tip of the pointing pen may be identified and it is determined that the tip stays in a fixed position on the plurality of first images beyond a predetermined time threshold.
Here, the predetermined time threshold may be set to t s, and the value range may be, for example, [500,2000], in units of ms. Of course, this is merely an example, and the time threshold may be another value.
To determine a dwell on a fixed location of the first images, it is possible to detect whether the coordinates of the index finger tip (e.g. the key point 8) or the nib stay within a small area. This is because the index finger or fingertip is not necessarily completely stationary, and a circular region of radius 20 to 50 pixels may be generally centered on the coordinates of the key point 8 where the "translation gesture" was first detected during the period of t s. If the coordinates of the key point 8 of the translation gesture remain within this circular area within the period of r s, then it can be considered that the translation gesture (index finger extension or nib) stays at a fixed position in the image for more than time r s, and it is determined that the first gesture is recognized.
Of course, the first gesture of the translation operation may be set to another gesture and recognized by another method, which is not illustrated here. The above-described manner of identifying the extension of the index finger is not limiting, and the extension of the index finger may be identified in other manners. In addition, the pointing object is not limited to an index finger or other fingers and a pointing pen, as other pointing objects are also suitable for use with the various embodiments of the present disclosure.
In order to avoid misrecognition and resource waste, the recognition of the first gesture may be started after the intelligent terminal is set to the image translation mode.
In addition, when the computing device does not recognize the first gesture indicative of a translation action, other gesture recognition processes may be performed or recognition may be stopped directly.
Next, as described previously, because the user may want to translate only a portion of text, rather than a full page of text, it is necessary to determine which portion of text the user is interested in.
It may be assumed that the pointed area that the user is pointing over with the pointing object is a translation area of interest to the user, and thus after the computing device recognizes the recognition judgment of the first gesture in the plurality of first images, the start point and the end point of the area of the plurality of first images pointed to by the first gesture (also referred to as the pointed area) may also be determined in order to determine the translation area of interest to the user.
FIG. 4A illustrates a flowchart of a process 400 for a computing device to recognize a recognition determination of a first gesture in a plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture, according to an embodiment of the present application.
As shown in fig. 4A, in one embodiment, the recognition determination that the computing device recognized the first gesture in the plurality of first images and the information related to the start and end points of the region of the plurality of first images pointed to by the first gesture include: the following steps are performed by the computing device.
Here, if the computing device determines that the intelligent terminal has set the image translation mode, an initialization may be performed, for example, setting the state of the first gesture to the first state. The first state may be a set "not in focus" state to indicate that a new gesture recognition process is currently in progress, avoiding conflicts with previous gesture recognition processes.
At step 410, a first gesture is identified in a plurality of first images.
Since the first gesture may be made by the user at the start point of the pointed area or at the end point of the pointed area, it is necessary to determine whether the first gesture at this time is at the start point or the end point. Specifically, in step 420, if the first gesture (for example, the first recognition) is not previously recognized or the state of the first gesture is the first state (i.e., the new gesture recognition after the initialization), the position pointed by the first gesture in the plurality of first images at this time is recorded as the first starting point, and the state of the first gesture at this time is set as the second state.
Here, the position pointed by the first posture of the first state may be a center position of the circular region.
Here, the second state may be an "acquire focus" state.
In step 430, if the first gesture is recognized again in the plurality of first images and the state of the first gesture is the second state, the position pointed to by the first gesture in the plurality of first images at this time is recorded as the first end point.
Here, the position pointed by the first posture of the second state may be a center position of the circular region.
In this way, the start point and the end point of the pointed area can be determined by determining whether the twice recognized first gesture is in the first state or the second state.
At step 440, the state of the first gesture is set to a first state. I.e. indicating that the recognition process of the round has been completed and a new round of recognition process is started.
Next, since the resolutions of the first image, the second image, and the third image are different, the resolutions of the coordinate systems of the respective images are also different, and therefore, the coordinates of the start point and the end point of the pointed area recognized in the first image are different from the coordinates of the corresponding position in the corresponding third image (preview image) and the second image of the second resolution (image photographed at high resolution), and coordinate conversion is necessary.
For example, the image resolution of the third image (preview image) is denoted as w p*hp, the sitting mark of one point in the coordinate system of the third image (preview image) is denoted as (x p,yp)), the image resolution of the first image (thumbnail) after the third image (preview image) is scaled is denoted as w r*hr, the coordinates in the coordinate system are denoted as (x r,yr)), the image resolution of the second image (image photographed at high resolution) is denoted as w t*ht, and the coordinates in the coordinate system are denoted as (x t,yt). Under the same camera, the following coordinate systems are satisfied:
Specifically, in step 450, the coordinates of the first start point and the first end point are respectively converted into the coordinates of the second start point and the second end point in the second image according to the coordinate mapping relationship (for example, formula 1 above) between the plurality of first images of the first resolution and the second image of the second resolution.
In this way, OCR and translation can be performed in a subsequent portion of the image of the second resolution (i.e., the region of interest) that is actually of interest to the user to translate (i.e., the high resolution) based on the coordinates of the second start point and the second end point in the second image of the second resolution.
In step 460, the coordinates of the second start point and the second end point are sent to the intelligent terminal as the start point and the end point of the area pointed to by the first gesture.
Of course, in another embodiment, the coordinates of the first start point and the first end point in the first image (thumbnail image) may be transmitted to the intelligent terminal instead in order to be converted from the coordinates of the first start point and the first end point to the coordinates of the second start point and the second end point in the second image, respectively, according to the coordinate mapping relationship (e.g., equation 1 above) between the plurality of first images of the first resolution and the second image of the second resolution in the intelligent terminal. Here, in order to sum up the above-described embodiments (transmitting the coordinates of the first start point and the first end point to the smart terminal or transmitting the coordinates of the second start point and the second end point to the smart terminal), the coordinates of the first start point and the first end point and the coordinates of the second start point and the second end point may be summarized as information related to the start point and the end point of the region pointed to by the first gesture in the plurality of first images. This embodiment is described below in conjunction with fig. 4B.
FIG. 4B illustrates a flowchart of another process 400' for a computing device to recognize a recognition determination of a first gesture in a plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture, according to an embodiment of the present application.
At step 410', a first gesture is identified in the plurality of first images.
In step 420', if the first gesture or the state of the first gesture is not previously recognized as the first state, recording the position pointed by the first gesture in the plurality of first images at this time as the first starting point, and setting the state of the first gesture at this time as the second state.
At step 430', if the first gesture is recognized again in the plurality of first images and the state of the first gesture is the second state, the position pointed to by the first gesture in the plurality of first images at this time is recorded as the first endpoint.
At step 440', the state of the first gesture is set to a first state.
At step 450', the coordinates of the first start point and the first end point are sent to the intelligent terminal.
Next, returning to fig. 2, at step 230, the intelligent terminal receives a recognition determination that the computing device recognized the first gesture in the plurality of first images and information related to the start and end points of the region of the plurality of first images pointed to by the first gesture.
The information about the start point and the end point of the region pointed to by the first gesture may include coordinates of the first start point and the first end point of the untransformed coordinates or coordinates of the second start point and the second end point of the second image converted into coordinates. Note that if the information about the start point and the end point of the area pointed by the first gesture includes the coordinates of the first start point and the first end point of the untransformed coordinates, then, at the terminal: according to the coordinate mapping relation between the plurality of first images with the first resolution and the second images with the second resolution, the coordinates of the first starting point and the first ending point are respectively converted into the coordinates of the second starting point and the second ending point in the second images as the starting point and the ending point of the area pointed by the first gesture.
Next, at step 240, the intelligent terminal is triggered to take a second image at a second resolution, and to intercept the region of interest at the second image based on information related to the start and end points of the region pointed to by the first gesture, wherein the second resolution is higher than the first resolution.
Specifically, after receiving a recognition determination that the computing device recognizes the first gesture in the plurality of first images and information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images, by the intelligent terminal: determining a region of interest (Region Of Interest, ROI) from a start point and an end point of the region pointed to by the first gesture; the region of interest is intercepted at the second resolution and sent to the computing device for recognition and translation by the computing device based on the region of interest.
In one embodiment, determining the region of interest from the start point and the end point of the region pointed to by the first gesture comprises: taking a starting point and an end point of a region pointed by the first gesture as an upper left point and a lower right point of a rectangular frame of the region of interest respectively; or the starting point and the end point of the area pointed by the first gesture are respectively used as the left lower point and the right lower point of the rectangular frame of the area of interest, and the rectangular frame of the area of interest has a preset height.
Fig. 5A shows a schematic diagram of the upper left and lower right points of the rectangular box with the start and end points of the region pointed to by the first gesture as the region of interest, respectively, according to an embodiment of the present application.
As shown in fig. 5A, it is assumed that the user is accustomed or educated to develop habits to slide from a start point of an upper left corner to an end point of a lower right corner of the region of interest with at least one finger to use the start point and the end point as upper left and lower right points of a rectangular frame of the region of interest, respectively.
Fig. 5B shows a schematic diagram of the starting point and the ending point of the region pointed to by the first gesture as the lower left and lower right points of the rectangular box of the region of interest (the rectangular box of the region of interest has a predetermined height), respectively, according to an embodiment of the present application.
As shown in fig. 5B, it is assumed that the user is accustomed or educated to slide from the start point of the lower left corner to the end point of the lower right corner of the region of interest with at least one finger to use the start point and the end point as the lower left and lower right points of the rectangular frame of the region of interest, respectively, and that the rectangular frame has a predetermined height. The predetermined height may be one or several rows.
Of course, the above two ways are only exemplary ways of determining the region of interest through the start point and the end point, but the disclosure is not limited thereto, and the region of interest may be determined in other ways, and the shape of the region of interest may be varied.
Referring back to FIG. 2, at step 250, the pointed region (i.e., the region of interest) in the second image is sent by the intelligent terminal to the computing device for receipt by the computing device of identifying and translating the text contained in the region of interest and obtaining a translated version.
Thus, by transmitting the first image with low resolution and obtaining the pointed region identified by the first image with low resolution, the translation result of translating the part related to the pointed region in the second image with high resolution is obtained, the transmission data volume is reduced, the data volume for identifying the pointed region is reduced, only the pointed region is translated in a targeted manner, and the accuracy of translating the pointed region is ensured.
In the process of identifying and translating the region of interest by the computing device, the computing device performs the following steps: identifying text from a region of interest; translating the recognized characters; and sending the translated translation to the intelligent terminal.
And then, rendering and broadcasting the translated text by the intelligent terminal.
Fig. 6 illustrates a flowchart of a process 600 for rendering a broadcast of a translation by a smart terminal according to an embodiment of the present application. Fig. 7 is a schematic diagram illustrating rendering and broadcasting of translations by an intelligent terminal according to an embodiment of the present application.
In step 601, the coordinates of the first start point and the first end point in the plurality of first images are respectively converted into the coordinates of the third start point and the third end point in the image of the third resolution (two black dots as shown in fig. 7) according to the coordinate mapping relation between the plurality of first images of the first resolution and the plurality of third images of the third resolution.
Here, as described above, the third image is a preview image, and the user desires to render the translation of the region of interest in the preview image viewable on the smart glasses, so that the coordinates of the region of interest in the preview image are obtained.
In step 602, a third starting point and a third ending point and a line connecting the third starting point and the third ending point (a diagonal line of the region of interest as shown in fig. 7) are marked in the third image, and a region commonly determined by the third starting point and the third ending point is rendered as a translation region. In which the text to be translated is included.
In step 603, a translation result box (such as the translation result box shown in fig. 7) is superimposed on the translation region in the third image for preview. As shown in fig. 7, the translation result box may be superimposed in a popup form using a popup image area.
In step 604, the translated version is displayed in a translation results box.
Under the condition that synchronous voice broadcasting of the translation is needed, the intelligent terminal can further perform the following steps.
In step 605, the translated version is initially initialized to an un-broadcast state. The micro-broadcast state may cause the translated text to be in a text format, such as gray, etc.
In step 606, corresponding voice audio is obtained from the translated version.
In step 607, the length of the translation and the length of the voice audio in the un-broadcast state are obtained.
In step 608, voice audio is announced. The translation text can be used to call a TTS service (such as a mobile phone end or a cloud end) to generate voice audio corresponding to the translation text.
In step 609, the duration of the currently announced voice audio is obtained (denoted as t s).
In step 610, the length of the translation that was not broadcast (denoted as l c) (e.g., the number of characters) and the length of the voice audio (denoted as t s), and the length of the voice audio that was currently broadcast (denoted as t u), are calculated (denoted as l r). The calculation formula is as follows:
In step 611, a translation of the broadcasted length (such as the broadcasted translation shown in fig. 7) is rendered in synchronization with the currently broadcasted voice audio. For example, it may be rendered with a highlighted text color, or in a bolded format, etc. The rest of the translation text is the text to be broadcasted as shown in fig. 7.
At step 612, if the voice audio broadcast is complete, the broadcast is ended. At this time, it may also be set that the translation result box automatically disappears after the voice broadcast is completed. Of course, the display of the translation result box may be ended in advance by other flow control.
In step 613, if the voice audio announcement is interrupted due to the recognition of the first gesture of the pointing object, the announcement is ended. At this point a new round of the first gesture recognition process of the pointing object may be started.
Therefore, the translation result box display has the advantages that only the translation region of interest of the user can be focused, the content displayed by the translation result is less, and the voice broadcasting only processes the translation region of interest selected by the user instead of the translation result of the whole image content, so that the broadcasting selection operation of the user is avoided. Moreover, according to the embodiment of the application, the voice is broadcasted synchronously with the rendering of the translation, so that the user can more intuitively see the translation.
Thus, according to the embodiment of the application, the amount of image data transmitted by interconnection between devices can be reduced by transmitting a low-resolution image and recognizing a specific gesture in the low-resolution image, and the amount of image data transmitted by interconnection between devices can be reduced by acquiring a translation region in a high-resolution image as a region of interest to transmit according to the start point and the end point of the specific gesture recognized in the low-resolution image. Further, OCR and translation can be performed from the photographed image of the high-resolution region of interest, and definition and translation accuracy of the translated region are ensured. Only translating the region of interest reduces the amount of OCR and translation computation and the amount of data returned by the interconnection. The translation result of the interested region is the translation content focused by the user, and the intelligent glasses terminal can directly conduct rendering and voice synthesis broadcasting after acquiring the translation text, so that the operation of selecting broadcasting by the user is reduced.
Fig. 8 illustrates a flowchart of an image processing method 800 at a computing device according to an embodiment of the application.
The computing device includes, for example, a cell phone, a tablet computer, a cloud server, and the like.
As shown in fig. 8, the image processing method 800 includes: step 810, receiving a plurality of first images with first resolution sent from the intelligent terminal; step 810 of recognizing a first gesture in the plurality of first images and determining a start point and an end point of a region pointed to by the first gesture in the plurality of first images according to the recognized first gesture; step 810, triggering the intelligent terminal to shoot a second image with a second resolution, and sending a start point and an end point of the area pointed by the first gesture to the intelligent terminal, so that the intelligent terminal obtains a region of interest with the second resolution according to the start point and the end point of the area pointed by the first gesture, wherein the region of interest is commonly determined by mapping the start point and the end point of the area pointed by the first gesture in the second image with the second resolution in the first image; step 810 receives a region of interest from the intelligent terminal for identification and translation from the region of interest, wherein the second resolution is higher than the first resolution.
Thus, according to the embodiment of the application, the amount of image data transmitted by interconnection between devices can be reduced by transmitting a low-resolution image and recognizing a specific gesture in the low-resolution image, and the amount of image data transmitted by interconnection between devices can be reduced by acquiring a translation region in a high-resolution image as a region of interest to transmit according to the start point and the end point of the specific gesture recognized in the low-resolution image. Further, OCR and translation can be performed from the photographed image of the high-resolution region of interest, and definition and translation accuracy of the translated region are ensured. Only translating the region of interest reduces the amount of OCR and translation computation and the amount of data returned by the interconnection. The translation result of the interested region is the translation content focused by the user, and the intelligent glasses terminal can directly conduct rendering and voice synthesis broadcasting after acquiring the translation text, so that the operation of selecting broadcasting by the user is reduced.
Fig. 9 shows a block diagram of a smart terminal 900 according to an embodiment of the present application.
Here, the smart terminal 900 may include smart glasses or the like.
As shown in fig. 9, the intelligent terminal 900 includes a photographing device 910 and a transceiving device 920. The photographing means 910 obtains a plurality of first images of a first resolution, and the transceiving means 910 transmits the plurality of first images to a computing device connected to the intelligent terminal 900.
The transceiver 920 receives the recognition determination that the computing device recognizes the first gesture in the plurality of first images and information related to the start and end points of the region pointed to by the first gesture in the plurality of first images, and the camera 910 triggers the capturing of a second image at a second resolution, and the capturing of the region of interest in the second image based on the information related to the start and end points of the region pointed to by the first gesture, wherein the second resolution is higher than the first resolution.
The transceiver 920 sends the second image to the computing device to receive a translated version of the text second image that the computing device identified and translated from the region of interest.
Fig. 10 illustrates a block diagram of a computing device 1000 according to an embodiment of the application. Here, the computing device 1000 may include a mobile terminal such as a cell phone, tablet computer, or the like.
The computing device 1000 includes: a receiving device 1010 configured to receive a plurality of first images of a first resolution from the intelligent terminal; a recognition device 1020 configured to recognize a first gesture in the plurality of first images and to determine a start point and an end point of a region pointed to by the first gesture in the plurality of first images based on the recognized first gesture; a transmitting device 1030 configured to trigger the intelligent terminal to take a second image with a second resolution and to transmit to the intelligent terminal a start point and an end point of the area pointed by the first gesture, so that the intelligent terminal obtains a region of interest of the second resolution from the start point and the end point of the area pointed by the first gesture, the region of interest being commonly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image to the second image of the second resolution; the region of interest is received from the intelligent terminal for identification and translation from the region of interest, wherein the second resolution is higher than the first resolution.
Thus, by transmitting the first image with low resolution and obtaining the pointed region identified by the first image with low resolution, the translation result of translating the part related to the pointed region in the second image with high resolution is obtained, the transmission data volume is reduced, the data volume for identifying the pointed region is reduced, only the pointed region is translated in a targeted manner, and the accuracy of translating the pointed region is ensured.
FIG. 11 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the application.
The electronic device may include a processor (H1); a storage medium (H2) coupled to the processor (H1) and having stored therein computer executable instructions for performing the steps of the methods of embodiments of the present application when executed by the processor.
The processor (H1) may include, but is not limited to, for example, one or more processors or microprocessors or the like.
The storage medium (H2) may include, for example, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a computer storage medium (e.g., hard disk, a floppy disk, a solid state disk, a removable disk, a CD-ROM, a DVD-ROM, a blu-ray disc, etc.).
In addition, the electronic device may include, but is not limited to, a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and an input/output device (H6) (e.g., keyboard, mouse, speaker, etc.), among others.
The processor (H1) may communicate with external devices (H5, H6, etc.) via a wired or wireless network (not shown) through an I/O bus (H4).
The storage medium (H2) may also store at least one computer executable instruction for performing the functions and/or steps of the methods in the embodiments described in the present technology when executed by the processor (H1).
In one embodiment, the at least one computer-executable instruction may also be compiled or otherwise formed into a software product in which one or more computer-executable instructions, when executed by a processor, perform the functions and/or steps of the methods described in the embodiments of the technology.
FIG. 12 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the application.
As shown in fig. 12, the computer-readable storage medium 1220 has instructions stored thereon, such as computer-readable instructions 1210. When executed by a processor, the computer-readable instructions 1210 may perform the various methods described with reference to the above. Computer-readable storage media include, but are not limited to, volatile memory and/or nonvolatile memory, for example. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. For example, the computer-readable storage medium 1220 may be connected to a computing device such as a computer, and then the various methods described above may be performed where the computing device runs the computer-readable instructions 1210 stored on the computer-readable storage medium 1220.
Note that advantages, effects, and the like mentioned in this disclosure are merely examples and are not to be construed as necessarily essential to the various embodiments of the application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.
The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
The step flow diagrams in this disclosure and the above method descriptions are merely illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. The order of steps in the above embodiments may be performed in any order, as will be appreciated by those skilled in the art. Words such as "thereafter," "then," "next," and the like are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of these methods. Furthermore, any reference to an element in the singular, for example, using the articles "a," "an," or "the," is not to be construed as limiting the element to the singular.
In addition, the steps and means in the various embodiments herein are not limited to practice in a certain embodiment, and indeed, some of the steps and some of the means associated with the various embodiments herein may be combined according to the concepts of the present application to contemplate new embodiments, which are also included within the scope of the present application.
The individual operations of the above-described method may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software components and/or modules including, but not limited to, circuitry for hardware, an Application Specific Integrated Circuit (ASIC), or a processor.
The various illustrative logical blocks, modules, and circuits described herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, microprocessors in conjunction with a DSP core, or any other such configuration.
The methods disclosed herein include gestures for implementing the described methods. The methods and/or gestures may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of gestures is specified, the order and/or use of specific gestures may be modified without departing from the scope of the claims.
The functions described above may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as instructions on a tangible computer-readable medium. A storage media may be any available tangible media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, discs (disks) and disks include Compact Disks (CDs), laser disks, optical disks, digital Versatile Disks (DVDs), floppy disks, and blu-ray disks where disks usually reproduce data magnetically, while disks reproduce data optically with lasers.
Accordingly, the present disclosure may also include a computer program product, wherein the computer program product may perform the methods, steps and operations presented herein. For example, such a computer program product may be a computer software package, computer code instructions, a computer-readable tangible medium having computer instructions tangibly stored (and/or encoded) thereon, the instructions being executable by a processor to perform operations described herein. The computer program product may comprise packaged material.
Furthermore, modules and/or other suitable means for performing the methods and techniques described herein may be downloaded and/or otherwise obtained by the user terminal and/or base station as appropriate. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Or the various methods described herein may be provided via a storage component such that the user terminal and/or base station can obtain the various methods when coupled to or providing the storage component to the device. Further, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (14)

1. An image processing method of an intelligent terminal includes
Obtaining, by an intelligent terminal, a plurality of first images of a first resolution;
transmitting the plurality of first images to a computing device connected to the intelligent terminal;
Receiving a recognition determination that the computing device recognized a first gesture in the plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture;
Triggering the intelligent terminal to shoot a second image with a second resolution, and intercepting a region of interest in the second image according to information related to a start point and an end point of the region pointed by the first gesture, wherein the second resolution is higher than the first resolution;
And the intelligent terminal sends the region of interest to the computing device so as to receive translated text which is recognized and translated by the computing device and contained in the region of interest.
2. The method of claim 1, wherein the first gesture comprises at least one finger extending and hovering over a fixed location of the plurality of first images beyond a first time threshold.
3. The method of claim 1, wherein the first gesture comprises a tip of a pointing pen resting on a fixed location of the plurality of first images for more than a predetermined time threshold.
4. The method of claim 1, wherein the receiving the recognition determination that the computing device recognized the first gesture in the plurality of first images and the information related to the start and end points of the region of the plurality of first images pointed to by the first gesture comprises:
By the computing device:
identifying a first gesture in the plurality of first images;
If the first gesture or the state of the first gesture is not recognized previously as a first state, recording the position pointed by the first gesture in the plurality of first images at the moment as a first starting point, and setting the state of the first gesture at the moment as a second state;
If the first gesture is recognized again in the plurality of first images and the state of the first gesture is a second state, recording that the position pointed by the first gesture in the plurality of first images at the moment is a first end point;
setting the state of the first gesture as a first state;
Converting coordinates of the first starting point and the first end point into coordinates of a second starting point and a second end point in the second image respectively according to coordinate mapping relations between the plurality of first images with the first resolution and the second image with the second resolution;
sending coordinates of the second starting point and the second ending point to the intelligent terminal as the starting point and the ending point of the area pointed by the first gesture;
And the intelligent terminal:
determining the region of interest from the start point and the end point of the pointed region, wherein,
Taking the starting point and the end point of the area pointed by the first gesture as the upper left point and the lower right point of the rectangular frame of the region of interest respectively; or the starting point and the end point of the area pointed by the first gesture are respectively used as a left lower point and a right lower point of the rectangular frame of the area of interest, and the rectangular frame of the area of interest has a preset height;
intercepting the region of interest with a second resolution and transmitting to the computing device for the computing device to identify and interpret according to the region of interest.
5. The method of claim 4, wherein the determining the region of interest from the start and end points of the region pointed to by the first gesture comprises:
The recognition determination that the received computing device recognized the first gesture in the plurality of first images and the information related to the start and end points of the region of the plurality of first images pointed to by the first gesture include:
By the computing device:
identifying a first gesture in the plurality of first images;
If the first gesture or the state of the first gesture is not recognized previously as a first state, recording the position pointed by the first gesture in the plurality of first images at the moment as a first starting point, and setting the state of the first gesture at the moment as a second state;
If the first gesture is recognized again in the plurality of first images and the state of the first gesture is a second state, recording that the position pointed by the first gesture in the plurality of first images at the moment is a first end point;
setting the state of the first gesture as a first state;
Sending the coordinates of the first starting point and the first ending point to the intelligent terminal;
And the intelligent terminal:
Converting coordinates of the first starting point and the first end point into coordinates of a second starting point and a second end point in the second image as the starting point and the end point of the area pointed by the first gesture respectively according to the coordinate mapping relation between the plurality of first images with the first resolution and the second image with the second resolution;
determining the region of interest from the start point and the end point of the pointed region, wherein,
Taking the starting point and the end point of the area pointed by the first gesture as the upper left point and the lower right point of the rectangular frame of the region of interest respectively; or the starting point and the end point of the area pointed by the first gesture are respectively used as a left lower point and a right lower point of the rectangular frame of the area of interest, and the rectangular frame of the area of interest has a preset height;
intercepting the region of interest with a second resolution and transmitting to the computing device for the computing device to identify and interpret according to the region of interest.
6. The method of claim 1, further comprising,
Acquiring, by the intelligent terminal, a plurality of third images for previewing in real time with a third resolution;
The plurality of third images is abbreviated as the plurality of first images,
Wherein the third resolution is greater than the first resolution and less than the second resolution.
7. The method of claim 6, wherein the intelligent terminal further performs the steps of:
Converting coordinates of the first start point and the first end point in the plurality of first images into coordinates of a third start point and a third end point in the image of the third resolution respectively according to a coordinate mapping relation between the plurality of first images of the first resolution and the plurality of third images of the third resolution;
Marking a third starting point and a third ending point and a connecting line between the third starting point and the third ending point in the third image, and rendering a region jointly determined by the third starting point and the third ending point as a translation region;
Superimposing a translation result box on the translation region in the third image for previewing;
and displaying the translated version in the translation result box.
8. The method of claim 7, wherein the intelligent terminal further performs the steps of:
Initializing the translated version into an unreported state;
acquiring corresponding voice audio according to the translated version;
acquiring the length of the translation in the non-broadcast state and the length of the voice audio;
Broadcasting the voice audio;
acquiring the duration of the voice audio which is currently broadcasted;
Calculating the broadcasted length of the translation according to the length of the translation in the non-broadcasted state, the length of the voice audio and the duration of the current broadcasted voice audio;
Rendering the translated text with the broadcasted length synchronously with the voice audio broadcasted at present;
if the voice audio broadcasting is completed, ending the broadcasting;
And if the voice audio broadcasting is interrupted due to the fact that the first gesture of the pointing object is recognized, ending the broadcasting.
9. An intelligent terminal comprises a shooting device and a receiving and transmitting device,
The shooting device obtains a plurality of first images with first resolution, and the receiving and transmitting device transmits the plurality of first images to computing equipment connected with the intelligent terminal;
The receiving and transmitting device receives the identification judgment that the computing device identifies the first gesture in the plurality of first images and the information related to the starting point and the ending point of the area pointed by the first gesture in the plurality of first images, the shooting device triggers shooting a second image with a second resolution, and intercepts the region of interest in the second image according to the information related to the starting point and the ending point of the area pointed by the first gesture, wherein the second resolution is higher than the first resolution;
The transceiver device sends the region of interest to the computing device so as to receive translated text which is recognized and translated by the computing device from text contained in the image of interest.
10. An image processing method includes
Receiving a plurality of first images with a first resolution sent from the intelligent terminal;
Identifying a first gesture in the plurality of first images and determining a start point and an end point of a region pointed to by the first gesture in the plurality of first images based on the identified first gesture; triggering the intelligent terminal to shoot a second image with a second resolution and sending a start point and an end point of the area pointed by the first gesture to the intelligent terminal so that the intelligent terminal obtains a region of interest with the second resolution according to the start point and the end point of the area pointed by the first gesture, wherein the region of interest is commonly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image into the second image with the second resolution;
and receiving the region of interest from the intelligent terminal so as to perform character recognition and translation according to the region of interest, wherein the second resolution is higher than the first resolution.
11. A computing device, comprising
A receiving device configured to receive a plurality of first images of a first resolution transmitted from the intelligent terminal;
A recognition device configured to recognize a first gesture in the plurality of first images and determine a start point and an end point of a region pointed to by the first gesture in the plurality of first images according to the recognized first gesture;
A transmitting device configured to trigger the intelligent terminal to shoot a second image with a second resolution, and transmit a start point and an end point of the area pointed by the first gesture to the intelligent terminal, so that the intelligent terminal obtains a region of interest with the second resolution according to the start point and the end point of the area pointed by the first gesture, wherein the region of interest is commonly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image into the second image with the second resolution;
and receiving the region of interest from the intelligent terminal so as to perform character recognition and translation according to the region of interest, wherein the second resolution is higher than the first resolution.
12. An electronic device, comprising:
A memory for storing instructions;
A processor for reading instructions in said memory and performing the method of any of claims 1-8, 10.
13. A non-transitory storage medium having instructions stored thereon,
Wherein the instructions, when read by a processor, cause the processor to perform the method of any one of claims 1-8, 10.
14. A computer program product comprising computer instructions,
Wherein the instructions, when read by a processor, cause the processor to perform the method of any one of claims 1-8, 10.
CN202410236725.4A 2024-02-29 2024-02-29 Image processing method, intelligent terminal, device, medium and program product Pending CN118116022A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410236725.4A CN118116022A (en) 2024-02-29 2024-02-29 Image processing method, intelligent terminal, device, medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410236725.4A CN118116022A (en) 2024-02-29 2024-02-29 Image processing method, intelligent terminal, device, medium and program product

Publications (1)

Publication Number Publication Date
CN118116022A true CN118116022A (en) 2024-05-31

Family

ID=91210127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410236725.4A Pending CN118116022A (en) 2024-02-29 2024-02-29 Image processing method, intelligent terminal, device, medium and program product

Country Status (1)

Country Link
CN (1) CN118116022A (en)

Similar Documents

Publication Publication Date Title
US9479693B2 (en) Method and mobile terminal apparatus for displaying specialized visual guides for photography
EP3547218B1 (en) File processing device and method, and graphical user interface
CN110276349B (en) Video processing method, device, electronic equipment and storage medium
EP2560145A2 (en) Methods and systems for enabling the creation of augmented reality content
CN107122113B (en) Method and device for generating picture
CN109614983A (en) The generation method of training data, apparatus and system
CN113076814B (en) Text area determination method, device, equipment and readable storage medium
KR20150105479A (en) Realization method and device for two-dimensional code augmented reality
CN111754414B (en) Image processing method and device for image processing
JP7364151B2 (en) Content translation methods, devices, systems and programs
CN105045504A (en) Image content extraction method and apparatus
CN111382748B (en) Image translation method, device and storage medium
CN111783508A (en) Method and apparatus for processing image
WO2023197648A1 (en) Screenshot processing method and apparatus, electronic device, and computer readable medium
WO2022088946A1 (en) Method and apparatus for selecting characters from curved text, and terminal device
JP2016200860A (en) Information processing apparatus, control method thereof, and program
US20150138077A1 (en) Display system and display controll device
CN113056905B (en) System and method for photographing tele-like image
US20230169785A1 (en) Method and apparatus for character selection based on character recognition, and terminal device
CN118116022A (en) Image processing method, intelligent terminal, device, medium and program product
CN114154467B (en) Structure picture restoration method, device, electronic equipment, medium and program product
US20200065604A1 (en) User interface framework for multi-selection and operation of non-consecutive segmented information
KR20140134844A (en) Method and device for photographing based on objects
CN113407038A (en) Input method, input device and input device
WO2022267696A1 (en) Content recognition method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination