CN118116022A

CN118116022A - Image processing method, intelligent terminal, device, medium and program product

Info

Publication number: CN118116022A
Application number: CN202410236725.4A
Authority: CN
Inventors: 王淼军; 郝冬宁; 陈芳; 寸毛毛
Original assignee: Hubei Xingji Meizu Group Co Ltd
Current assignee: Hubei Xingji Meizu Group Co Ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-05-31

Abstract

Provided are an image processing method of an intelligent terminal, an image processing method of a computing device, an electronic device, a non-transitory storage medium, and a computer program product. The image processing method of the intelligent terminal comprises the following steps: obtaining, by an intelligent terminal, a plurality of first images of a first resolution; transmitting the plurality of first images to a computing device connected to the intelligent terminal; receiving a recognition determination that the computing device recognized a first gesture in the plurality of first images and information related to a start point and an end point of a region of the plurality of first images pointed to by the first gesture; triggering the intelligent terminal to shoot a second image with a second resolution, and intercepting a region of interest in the second image according to information related to a start point and an end point of the region pointed by the first gesture, wherein the second resolution is higher than the first resolution; and the intelligent terminal sends the region of interest to the computing device so as to receive translated text which is recognized and translated by the computing device and contained in the region of interest.

Description

Image processing method, intelligent terminal, device, medium and program product

技术领域Technical Field

本申请涉及智能终端领域，且更具体地，涉及智能终端的图像处理方法、智能终端、计算设备的图像处理方法、计算设备、电子设备、非暂时存储介质、和计算机程序产品。The present application relates to the field of smart terminals, and more specifically, to an image processing method for a smart terminal, a smart terminal, an image processing method for a computing device, a computing device, an electronic device, a non-temporary storage medium, and a computer program product.

背景技术Background technique

传统的移动端智能设备通常由手机作为载体。随着万物互联的发展，移动端智能设备逐步向可穿戴智能设备过渡。智能眼镜作为可穿戴设备的一种，提供了很多智慧功能，例如沉浸式视觉渲染(虚拟现实(VR)、增强现实(AR)、混合现实(MR)等)、智能出行、智慧导航、拍照翻译等。相比于手机，智能眼镜可根据佩戴者的面部朝向方便地获取图像。例如在拍照翻译方面，使用智能眼镜比手机更方便。Traditional mobile smart devices are usually carried by mobile phones. With the development of the Internet of Everything, mobile smart devices are gradually transitioning to wearable smart devices. As a type of wearable device, smart glasses provide many smart functions, such as immersive visual rendering (virtual reality (VR), augmented reality (AR), mixed reality (MR), etc.), smart travel, smart navigation, photo translation, etc. Compared with mobile phones, smart glasses can easily obtain images according to the wearer's facial orientation. For example, in terms of photo translation, it is more convenient to use smart glasses than mobile phones.

发明内容Summary of the invention

根据本申请的一个方面，提供一种智能终端的图像处理方法，包括：由智能终端获得第一分辨率的多个第一图像；将所述多个第一图像发送到与所述智能终端相连接的计算设备；收到所述计算设备在所述多个第一图像中识别到第一姿势的识别判断和与所述多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息；触发所述智能终端用第二分辨率拍摄第二图像，并根据与所述被所述第一姿态所指点区域的起点和终点相关的信息在所述第二图像截取感兴趣区域，其中所述第二分辨率高于所述第一分辨率；由所述智能终端向所述计算设备发送所述感兴趣区域以便接收所述计算设备对所述感兴趣区域包含的文字进行识别和翻译后的译文。According to one aspect of the present application, there is provided an image processing method for a smart terminal, comprising: obtaining, by the smart terminal, a plurality of first images of a first resolution; sending the plurality of first images to a computing device connected to the smart terminal; receiving a recognition judgment of the computing device recognizing a first posture in the plurality of first images and information related to the starting point and end point of an area pointed to by the first posture in the plurality of first images; triggering the smart terminal to capture a second image with a second resolution, and capturing an area of interest in the second image according to the information related to the starting point and end point of the area pointed to by the first posture, wherein the second resolution is higher than the first resolution; and sending the area of interest to the computing device by the smart terminal so as to receive a translation after the computing device recognizes and translates the text contained in the area of interest.

根据本申请的另一个方面，提供一种智能终端，包括拍摄装置和收发装置，其中所述拍摄装置获得第一分辨率的多个第一图像，所述收发装置将所述多个第一图像发送到与所述智能终端相连接的计算设备；所述收发装置收到所述计算设备在所述多个第一图像中识别到第一姿势的识别判断和与所述多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息，所述拍摄装置触发用第二分辨率拍摄第二图像，并根据与所述被所述第一姿态所指点区域的起点和终点相关的信息在所述第二图像截取感兴趣区域，其中所述第二分辨率高于所述第一分辨率；所述收发装置向所述计算设备发送所述感兴趣区域以便接收所述计算设备对所述感兴趣图像包含的文字进行识别和翻译后的译文。According to another aspect of the present application, a smart terminal is provided, comprising a shooting device and a transceiver, wherein the shooting device obtains multiple first images with a first resolution, and the transceiver sends the multiple first images to a computing device connected to the smart terminal; the transceiver receives a recognition judgment of the computing device that a first posture is recognized in the multiple first images and information related to the starting point and end point of the area pointed by the first posture in the multiple first images, the shooting device triggers shooting of a second image with a second resolution, and captures an area of interest in the second image according to the information related to the starting point and end point of the area pointed by the first posture, wherein the second resolution is higher than the first resolution; the transceiver sends the area of interest to the computing device so as to receive a translation after the computing device recognizes and translates the text contained in the image of interest.

根据本申请的另一个方面，提供一种图像处理方法，包括：接收从智能终端发来的第一分辨率的多个第一图像；在所述多个第一图像中识别到第一姿势，并根据识别到的所述第一姿势确定在所述多个第一图像中被所述第一姿态所指点区域的起点和终点；触发所述智能终端使用第二分辨率拍摄第二图像并向所述智能终端发送所述被所述第一姿态所指点区域的起点和终点，以便所述智能终端根据所述被所述第一姿态所指点区域的起点和终点获得第二分辨率的感兴趣区域，所述感兴趣区域是由所述第一图像中的所述被所述第一姿态所指点区域的起点和终点映射到所述第二分辨率的第二图像中来共同确定的；从所述智能终端接收所述感兴趣区域以便根据所述感兴趣区域来进行文字识别和翻译，其中所述第二分辨率高于所述第一分辨率。According to another aspect of the present application, an image processing method is provided, comprising: receiving multiple first images of a first resolution sent from a smart terminal; recognizing a first posture in the multiple first images, and determining the starting point and end point of the area pointed by the first posture in the multiple first images according to the recognized first posture; triggering the smart terminal to shoot a second image using a second resolution and sending the starting point and end point of the area pointed by the first posture to the smart terminal, so that the smart terminal obtains an area of interest of a second resolution according to the starting point and end point of the area pointed by the first posture, wherein the area of interest is jointly determined by mapping the starting point and end point of the area pointed by the first posture in the first image to the second image of the second resolution; receiving the area of interest from the smart terminal so as to perform text recognition and translation according to the area of interest, wherein the second resolution is higher than the first resolution.

根据本申请的另一个方面，提供一种计算设备，包括：接收装置，被配置为接收从智能终端发来的第一分辨率的多个第一图像；识别装置，被配置为在所述多个第一图像中识别到第一姿势，并根据识别到的所述第一姿势确定在所述多个第一图像中被所述第一姿态所指点区域的起点和终点；发送装置，被配置为触发智能终端使用第二分辨率拍摄第二图像，并向智能终端发送所述被所述第一姿态所指点区域的起点和终点，以便所述智能终端根据所述被所述第一姿态所指点区域的起点和终点获得第二分辨率的感兴趣区域，所述感兴趣区域是由所述第一图像中的所述被所述第一姿态所指点区域的起点和终点映射到所述第二分辨率的第二图像中来共同确定的；从所述智能终端接收所述感兴趣区域以便根据所述感兴趣区域来进行文字识别和翻译，其中所述第二分辨率高于所述第一分辨率。According to another aspect of the present application, a computing device is provided, comprising: a receiving device, configured to receive multiple first images of a first resolution sent from a smart terminal; a recognition device, configured to recognize a first posture in the multiple first images, and determine the starting point and end point of the area pointed by the first posture in the multiple first images according to the recognized first posture; a sending device, configured to trigger the smart terminal to shoot a second image using a second resolution, and send the starting point and end point of the area pointed by the first posture to the smart terminal, so that the smart terminal obtains an area of interest of a second resolution according to the starting point and end point of the area pointed by the first posture, wherein the area of interest is jointly determined by mapping the starting point and end point of the area pointed by the first posture in the first image to the second image of the second resolution; the area of interest is received from the smart terminal so as to perform text recognition and translation according to the area of interest, wherein the second resolution is higher than the first resolution.

根据本申请的另一个方面，提供一种电子设备，包括：存储器，用于存储指令；处理器，用于读取所述存储器中的指令，并执行根据本申请的实施例的方法。According to another aspect of the present application, an electronic device is provided, including: a memory for storing instructions; a processor for reading the instructions in the memory and executing a method according to an embodiment of the present application.

根据本申请的另一个方面，提供一种非暂时存储介质，其上存储有指令，其中，所述指令在被处理器读取时，使得所述处理器执行根据本申请的实施例的方法。According to another aspect of the present application, a non-transitory storage medium is provided, on which instructions are stored, wherein when the instructions are read by a processor, the processor executes a method according to an embodiment of the present application.

根据本申请的另一个方面，提供一种计算机程序产品，包括计算机指令，其中，所述指令在被处理器读取时，使得所述处理器执行根据本申请的实施例的方法。According to another aspect of the present application, a computer program product is provided, including computer instructions, wherein when the instructions are read by a processor, the processor is caused to execute the method according to the embodiment of the present application.

如此，通过传输低分辨率的第一图像并用低分辨率的第一图像来识别被指点区域，来对高分辨率的第二图像中被指点区域所涉及的部分进行翻译，而非对整个图像进行翻译，不仅降低了传输数据量，减少了识别被指点区域的数据量，有针对性地仅对被指点区域进行翻译，又保证了被指点区域的翻译的准确度。In this way, by transmitting a low-resolution first image and using the low-resolution first image to identify the pointed area, the part of the high-resolution second image involving the pointed area is translated instead of translating the entire image. This not only reduces the amount of transmitted data and the amount of data for identifying the pointed area, but also allows the pointed area to be translated in a targeted manner, while ensuring the accuracy of the translation of the pointed area.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本公开实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本公开的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present disclosure. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1示出了应用根据本申请的各个实施例的场景图。FIG. 1 shows a scene diagram of applying various embodiments of the present application.

图2示出了根据本申请的实施例的智能终端的图像处理方法的流程图。FIG2 shows a flow chart of an image processing method for a smart terminal according to an embodiment of the present application.

图3示出了根据本申请的实施例的在用人右手的手指作为指点物体的情况下的第一姿势的示例图。FIG. 3 shows an example diagram of a first gesture in a case where a finger of a person's right hand is used as a pointing object according to an embodiment of the present application.

图4A示出了根据本申请的实施例的计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息的过程的流程图。4A shows a flowchart of a process in which a computing device recognizes a first gesture in a plurality of first images and generates information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images according to an embodiment of the present application.

图4B示出了根据本申请的实施例的计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息的另一过程的流程图。4B shows a flowchart of another process in which a computing device recognizes a first gesture in a plurality of first images and generates information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images according to an embodiment of the present application.

图5A示出了根据本申请的实施例的将被所述第一姿态所指点区域的起点和终点分别作为感兴趣区域的矩形框的左上方点和右下方点的示意图。FIG5A shows a schematic diagram of using the start point and the end point of the area pointed by the first gesture as the upper left point and the lower right point of the rectangular frame of the area of interest, respectively, according to an embodiment of the present application.

图5B示出了根据本申请的实施例的将被所述第一姿态所指点区域的起点和终点分别作为感兴趣区域的矩形框的左下方点和右下方点(感兴趣区域的矩形框具有预定高度)的示意图。5B shows a schematic diagram of using the start point and the end point of the area pointed by the first gesture as the lower left point and the lower right point of the rectangular frame of the area of interest (the rectangular frame of the area of interest has a predetermined height) according to an embodiment of the present application.

图6示出了根据本申请的实施例的由智能终端进行译文的渲染播报的过程的流程图。FIG6 shows a flow chart of a process of rendering and broadcasting a translation by a smart terminal according to an embodiment of the present application.

图7示出了根据本申请的实施例的由智能终端进行译文的渲染播报的示意图。FIG. 7 shows a schematic diagram of rendering and broadcasting of translation by a smart terminal according to an embodiment of the present application.

图8示出了根据本申请的实施例的一种计算设备的图像处理方法的流程图。FIG8 shows a flowchart of an image processing method of a computing device according to an embodiment of the present application.

图9示出了根据本申请的实施例的智能终端的方框图。FIG9 shows a block diagram of a smart terminal according to an embodiment of the present application.

图10示出了根据本申请的实施例的计算设备的方框图。FIG. 10 shows a block diagram of a computing device according to an embodiment of the present application.

图11示出了适于用来实现本申请的实施例的示例性电子设备的框图。FIG. 11 shows a block diagram of an exemplary electronic device suitable for implementing embodiments of the present application.

图12示出了根据本申请的实施例的非暂时性计算机可读存储介质的示意图。FIG. 12 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the present application.

具体实施方式Detailed ways

现在将详细参照本申请的具体实施例，在附图中例示了本申请的例子。尽管将结合具体实施例描述本申请，但将理解，不是想要将本申请限于描述的实施例。相反，想要覆盖由所附权利要求限定的在本申请的精神和范围内包括的变更、修改和等价物。应注意，这里描述的方法步骤都可以由任何功能块或功能布置来实现，且任何功能块或功能布置可被实现为物理实体或逻辑实体、或者两者的组合。Reference will now be made in detail to specific embodiments of the present application, examples of which are illustrated in the accompanying drawings. Although the present application will be described in conjunction with specific embodiments, it will be understood that it is not intended to limit the present application to the described embodiments. On the contrary, it is intended to cover changes, modifications and equivalents included in the spirit and scope of the present application as defined by the appended claims. It should be noted that the method steps described herein can all be implemented by any functional block or functional arrangement, and any functional block or functional arrangement can be implemented as a physical entity or a logical entity, or a combination of the two.

作为可穿戴设备，智能眼镜的设计追求时尚、轻便。当前智能眼镜触碰式物理交互方式非常不方便，且智能眼镜自身算力有限，通常不能直接在眼镜端执行人工智能(Artificial Intelligence，AI)模型的推理计算等。另外，考虑到智能眼镜的续航，很多耗时耗算力的处理一般都借助于外部设备的互联完成，例如智能眼镜与外部手机的互联。As a wearable device, smart glasses are designed to be fashionable and lightweight. The current touch-based physical interaction method of smart glasses is very inconvenient, and the computing power of smart glasses themselves is limited. Usually, they cannot directly perform reasoning calculations of artificial intelligence (AI) models on the glasses. In addition, considering the battery life of smart glasses, many time-consuming and computing-intensive processes are generally completed with the help of external device interconnection, such as the interconnection between smart glasses and external mobile phones.

在现有的基于智能眼镜和手机互联的拍照翻译过程中，智能眼镜获取其所拍摄的预览图像并实时显示，当需要进行拍照翻译时，拍摄者对准需要翻译的对象，单击智能眼镜侧面的触控板触发拍照，智能眼镜获取拍照图像，然后传输给手机，手机执行完图像翻译(包括光学字符识别(Optical Character Recognition，OCR)和文本翻译)之后，将翻译结果返回给智能眼镜。在智能眼镜接收到翻译结果之后，调用云端文本转语音(Text ToSpeech，TTS)功能将文本转换成语音，然后进行语音播报。In the existing photo translation process based on the interconnection of smart glasses and mobile phones, the smart glasses obtain the preview image taken and display it in real time. When photo translation is needed, the photographer aims at the object to be translated and clicks the touch pad on the side of the smart glasses to trigger the photo. The smart glasses obtain the photo image and then transmit it to the mobile phone. After the mobile phone performs image translation (including optical character recognition (OCR) and text translation), the translation result is returned to the smart glasses. After the smart glasses receive the translation result, they call the cloud text to speech (TTS) function to convert the text into speech, and then perform voice broadcast.

然而，现有的整个拍照翻译过程存在以下问题：However, the existing entire photo translation process has the following problems:

(1)图像缩放影响了翻译效果。现有的拍照翻译往往是将整个图像通过互联传输到手机端进行图像翻译，互联传输数据量大。为较少传输链路耗时，有时会对分辨率大的图像做缩放处理，通过降低分辨率以减少传输数据量。但是分辨率的降低会导致图像清晰度降低，影响文字提取结果，进而影响翻译结果。(1) Image scaling affects the translation effect. Existing photo translation often transmits the entire image to the mobile phone through the Internet for image translation. The amount of data transmitted over the Internet is large. In order to reduce the time spent on the transmission link, sometimes images with high resolution are scaled to reduce the amount of data transmitted by reducing the resolution. However, the reduction in resolution will lead to a decrease in image clarity, affecting the text extraction results and further affecting the translation results.

(2)传输了多余数据，且回填填充变形。现有的图像翻译是对整个图像传输到手机端以进行这个图像中的文字的OCR文本识别及翻译，但在实际业务中，用户可能只需翻译图像中的部分文字内容。且现有的图像翻译涉及到译文回填的后处理，在翻译之后，源语言与目标语言的字符长度可能不一致，导致用目标语言的译文回填到源语言位置而产生的填充变形。(2) Redundant data is transmitted, and backfilling is deformed. Existing image translation is to transmit the entire image to the mobile phone for OCR text recognition and translation of the text in the image. However, in actual business, users may only need to translate part of the text content in the image. In addition, existing image translation involves post-processing of backfilling the translation. After translation, the character lengths of the source language and the target language may be inconsistent, resulting in deformation of the filling caused by backfilling the target language translation into the source language position.

(3)翻译速度受影响且回传数据量大。现有的图像翻译是对整个图像进行OCR文本识别及翻译，翻译速度较慢，且返回的翻译文本结果多，导致互联回传数据量大。(3) Translation speed is affected and the amount of data transmitted back is large. The existing image translation is to perform OCR text recognition and translation on the entire image. The translation speed is slow and the number of translated text results returned is large, resulting in a large amount of data transmitted back to the Internet.

(4)用户操作复杂且耗时。通常在眼镜端获取到翻译结果之后，需要调用云端TTS进行文本播报，现有的处理方式是将各TTS文本进行列表显示，用户使用触控板上下滑动进行播报选择。在整个图像中涉及的文本较多的场景下，用户需要先找到自己要翻译的文本，然后才能进行播报。(4) User operations are complex and time-consuming. Usually, after the glasses obtain the translation results, they need to call the cloud TTS to broadcast the text. The existing processing method is to display each TTS text in a list, and the user uses the touchpad to slide up and down to broadcast the selection. In the scenario where there are many texts involved in the entire image, the user needs to find the text he wants to translate before he can broadcast it.

针对现有的智能眼镜与手机互联的拍照翻译的技术的不足，本公开提出了一种智能终端的图像处理方法，包括：由智能终端获得第一分辨率的多个第一图像；将多个第一图像发送到与智能终端相连接的计算设备；收到计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息；触发智能终端用第二分辨率拍摄第二图像，并根据与被所述第一姿态所指点区域的起点和终点相关的信息在所述第二图像截取感兴趣区域，其中第二分辨率高于第一分辨率；由智能终端向计算设备发送第二图像以便接收计算设备对感兴趣区域包含的文字第二图像进行识别和翻译的翻译后的译文。In view of the shortcomings of the existing photo translation technology for interconnecting smart glasses with mobile phones, the present disclosure proposes an image processing method for a smart terminal, including: obtaining a plurality of first images with a first resolution by the smart terminal; sending the plurality of first images to a computing device connected to the smart terminal; receiving a recognition judgment of the computing device that a first posture is recognized in the plurality of first images and information related to the starting point and end point of the area pointed to by the first posture in the plurality of first images; triggering the smart terminal to capture a second image with a second resolution, and capturing an area of interest in the second image according to the information related to the starting point and end point of the area pointed to by the first posture, wherein the second resolution is higher than the first resolution; and sending the second image to the computing device by the smart terminal so as to receive a translation after the computing device recognizes and translates the text contained in the second image of the area of interest.

如图1所示，智能终端110例如是智能眼镜，其上布置有用于拍摄图像的摄像头111。计算设备120例如是移动终端，诸如手机、平板电脑等。在一些实施例中，计算设备120也可以是云端的计算机(例如，云端的服务器)。As shown in Figure 1, the smart terminal 110 is, for example, smart glasses, on which a camera 111 for capturing images is arranged. The computing device 120 is, for example, a mobile terminal, such as a mobile phone, a tablet computer, etc. In some embodiments, the computing device 120 may also be a cloud computer (e.g., a cloud server).

在应用本申请的各个实施例时，智能终端110的摄像头111能够获取介质130(例如纸张、书页、电子屏幕等)的连续的预览图像(例如本文后续提到的第三图像)。按预览分辨率(例如本文中后续提到的第三分辨率、例如1080*720)在智能终端上能够实时显示连续的预览图像(例如用VR、AR等方式显示或在眼镜片上显示等方式)。同时，该连续的预览图像可以被缩放为连续的缩略图(例如，本文后续提到的第一图像，具有本文中后续提到的第一分辨率、例如320*240)，该连续的缩略图能够被传输到计算设备120以便计算设备120能够识别缩略图中存在指点物体(例如手)140的特定姿势(例如翻译姿势)，并识别被指点物体指点的要翻译的感兴趣区域131，且告知智能终端110有关该要翻译的感兴趣区域131的信息。When applying various embodiments of the present application, the camera 111 of the smart terminal 110 can obtain continuous preview images (such as the third image mentioned later in this article) of the medium 130 (such as paper, book pages, electronic screens, etc.). The continuous preview images can be displayed in real time on the smart terminal at a preview resolution (such as the third resolution mentioned later in this article, such as 1080*720) (such as displaying in VR, AR, etc. or displaying on glasses, etc.). At the same time, the continuous preview images can be scaled into continuous thumbnails (such as the first image mentioned later in this article, with the first resolution mentioned later in this article, such as 320*240), and the continuous thumbnails can be transmitted to the computing device 120 so that the computing device 120 can recognize the specific gesture (such as the translation gesture) of the pointing object (such as a hand) 140 in the thumbnail, and identify the region of interest 131 to be translated pointed by the pointed object, and inform the smart terminal 110 of the information about the region of interest 131 to be translated.

智能终端110的摄像头111(以本文中后续提到的第二分辨率、例如1920*1080)能够拍摄需要翻译的对象(例如介质130上的要翻译的感兴趣区域131)的图像，并能够截取图1中的感兴趣区域131的部分(例如本文后续提到的第二图像)传输给计算设备120。The camera 111 of the smart terminal 110 (with a second resolution mentioned later in this document, for example, 1920*1080) can capture an image of an object to be translated (for example, an area of interest 131 to be translated on the medium 130), and can capture a portion of the area of interest 131 in Figure 1 (for example, a second image mentioned later in this document) and transmit it to the computing device 120.

计算设备120在执行完对该部分图像翻译(包括光学字符识别(OpticalCharacter Recognition，OCR)和文本翻译)之后，将翻译结果译文返回给智能终端110。After completing the translation of the portion of the image (including optical character recognition (OCR) and text translation), the computing device 120 returns the translation result to the smart terminal 110 .

在智能终端110接收到翻译结果译文之后，能够调用云端或本地或其他设备(例如，计算设备120)的文本转语音(Text To Speech，TTS)功能将翻译结果译文转换成语音，然后在智能终端110上渲染显示翻译结果译文并进行同步的语音播报。After the smart terminal 110 receives the translation result, it can call the text-to-speech (TTS) function of the cloud or local or other devices (for example, the computing device 120) to convert the translation result into speech, and then render and display the translation result on the smart terminal 110 and perform a synchronous voice broadcast.

接下来进一步结合附图和实施例来详细描述本公开的各个实施例的具体细节。Next, the specific details of various embodiments of the present disclosure are further described in detail in conjunction with the drawings and embodiments.

图2示出了根据本申请的实施例的智能终端的图像处理方法200的流程图。FIG. 2 shows a flowchart of an image processing method 200 of a smart terminal according to an embodiment of the present application.

在此，智能终端可以包括例如智能眼镜等的带有摄像头或相机的可穿戴设备。在本文中，以智能眼镜为例来进行描述。在智能终端上可以安装有相关应用程序以便进行设置、显示、拍照、传输、语音播报等功能。Here, the smart terminal may include a wearable device with a camera or a camera, such as smart glasses. In this article, smart glasses are used as an example for description. Relevant applications may be installed on the smart terminal to perform functions such as setting, display, taking pictures, transmitting, and voice broadcasting.

在此，计算设备可以包括诸如手机、平板电脑等的具备处理能力的移动终端(或者云端的服务器)，在计算设备上可以安装有相关应用程序以便进行图像识别、图像转文字、文本翻译等功能。在本文中，以手机为例来进行描述。Here, the computing device may include a mobile terminal (or a cloud server) with processing capabilities such as a mobile phone or a tablet computer, and related applications may be installed on the computing device to perform functions such as image recognition, image-to-text, and text translation. In this article, a mobile phone is used as an example for description.

用户佩戴好诸如智能眼镜的智能终端之后，智能眼镜上的摄像头可以获取到智能眼镜前方的事物的图像。假如用户在看诸如书本、平板电脑等的介质上的诸如英文的外文文章，那么智能眼镜上的摄像头可以获取到包括英文文章的介质的连续的多个图像，如果用户用手指或指点笔等其他指点物体来指点介质上的具体英文句子或段落，则智能眼镜上的摄像头可以获取到包括英文文章的介质以及用户的手指或指点笔等的多个图像。After the user wears the smart terminal such as smart glasses, the camera on the smart glasses can obtain images of the objects in front of the smart glasses. If the user is reading a foreign article such as English on a medium such as a book or a tablet, the camera on the smart glasses can obtain multiple continuous images of the medium including the English article. If the user points to a specific English sentence or paragraph on the medium with a finger or other pointing object such as a pointing pen, the camera on the smart glasses can obtain multiple images of the medium including the English article and the user's finger or pointing pen.

可以事先在智能终端上设置例如智能眼镜的摄像头的预览属性以及智能眼镜的摄像头的拍照属性。在此，“预览”指的是连续获取摄像头所获取的实时图像并连续显示作为预览图像，可看做显示动态图像。而“拍照”指的是触发捕获(至少一幅)静止图像。For example, the preview attribute of the camera of the smart glasses and the photo taking attribute of the camera of the smart glasses can be set in advance on the smart terminal. Here, "preview" refers to continuously acquiring the real-time image acquired by the camera and continuously displaying it as a preview image, which can be regarded as displaying a dynamic image. "Photo taking" refers to triggering the capture of (at least one) still image.

预览属性主要包括预览图像的分辨率、预览图像的采样帧率等。拍照属性主要包括拍照图像的分辨率等。The preview attributes mainly include the resolution of the preview image, the sampling frame rate of the preview image, etc. The photo attributes mainly include the resolution of the photo image, etc.

假设用户看到想要翻译的英文单词或英文句子或段落，用户最方便的方式是用其手指或指点笔等指点物体来(例如从左到右地)划出想要翻译的英文部分。因此，在本文中，希望能够通过一些手段来实时地识别出该指点物体的这种翻译姿势(手势)来触发对该英文部分的文本OCR识别和翻译等等。Assuming that a user sees an English word, sentence or paragraph that he wants to translate, the most convenient way for the user is to use his finger or a pointing object such as a stylus to mark (for example, from left to right) the English part that he wants to translate. Therefore, in this article, it is hoped that some means can be used to recognize the translation posture (gesture) of the pointing object in real time to trigger text OCR recognition and translation of the English part.

在本文中，希望基于预览图像来在计算设备处进行例如翻译手势的特定姿势的实时识别，考虑如果分辨率设置过低，在智能眼镜近眼显示不清晰，但如果分辨率设置过高，可能导致之后向计算设备互联地传输图像的数据量大，导致传输速度慢，达不到翻译手势识别的帧率。综合考虑，优选地，一般设置预览图像的分辨率(例如在本文中的第三分辨率)为1080*720，以满足近眼显示清晰度。In this article, it is hoped that a specific gesture such as a translation gesture can be recognized in real time at a computing device based on a preview image. If the resolution is set too low, the near-eye display of the smart glasses will not be clear. However, if the resolution is set too high, a large amount of data may be transmitted to the computing device for interconnection, resulting in a slow transmission speed and failure to reach the frame rate of the translation gesture recognition. Taking all factors into consideration, it is preferred that the resolution of the preview image (such as the third resolution in this article) is generally set to 1080*720 to meet the near-eye display clarity.

而在向计算设备传输之前可以先对该预览图像(例如在本文中的第三图像)进行缩放处理，获取低分辨率(例如在本文中的第一分辨率)的缩略图(例如在本文中的第一图像)再进行传输，以减小传输数据量，确保较快的图像传输帧率。缩略图的分辨率只需能够满足翻译手势识别即可，因此，可以在实际中设置缩略图的分辨率为320*240，传输帧率可为25到30fps，即可正常完成翻译手势的关键点检测及翻译手势识别。Before transmitting to the computing device, the preview image (such as the third image in this article) can be scaled to obtain a thumbnail (such as the first image in this article) of low resolution (such as the first resolution in this article) and then transmitted to reduce the amount of transmitted data and ensure a faster image transmission frame rate. The resolution of the thumbnail only needs to be able to meet the translation gesture recognition. Therefore, in practice, the resolution of the thumbnail can be set to 320*240, and the transmission frame rate can be 25 to 30fps, so that the key point detection and translation gesture recognition of the translation gesture can be completed normally.

而考虑到需要基于拍照图像(例如在本文中的第二图像)来进行OCR以及翻译等，因此可以尽可能设置大的分辨率以确保清晰度来进行更准确的OCR，在实际使用中可以设置拍照图像的分辨率为1920*1080(例如在本文中的第二分辨率)。Considering that OCR and translation need to be performed based on the photographed image (such as the second image in this article), the resolution can be set as large as possible to ensure clarity for more accurate OCR. In actual use, the resolution of the photographed image can be set to 1920*1080 (such as the second resolution in this article).

可见，拍照图像的第二分辨率大于预览图像的第三分辨率，预览图像的第三分辨率大于预览图像的缩略图的第一分辨率。It can be seen that the second resolution of the photographed image is greater than the third resolution of the preview image, and the third resolution of the preview image is greater than the first resolution of the thumbnail of the preview image.

因此，在一个实施例中，在步骤210之前，由智能终端用第三分辨率实时采集用于预览的多个第三图像(例如预览图像以供预览)；将多个第三图像缩略成多个第一图像(例如缩略图以供发送到计算设备进行特定姿势识别))，其中第三分辨率大于第一分辨率，且小于第二分辨率。Therefore, in one embodiment, before step 210, the smart terminal captures multiple third images for preview in real time with a third resolution (e.g., preview images for preview); and thumbnails the multiple third images into multiple first images (e.g., thumbnails for sending to a computing device for specific gesture recognition), wherein the third resolution is greater than the first resolution and less than the second resolution.

该智能终端与计算设备之间的互联通道可以在智能终端与计算设备开机时建立，也可以在步骤210之前建立，也可以在步骤220之前或需要时建立。该互连方式可以包括蓝牙等短距离无线通信。The interconnection channel between the smart terminal and the computing device can be established when the smart terminal and the computing device are turned on, or before step 210, or before step 220 or when needed. The interconnection method can include short-range wireless communication such as Bluetooth.

如图2所示，图像处理方法200至少包括步骤210-步骤250。As shown in FIG. 2 , the image processing method 200 at least includes steps 210 to 250 .

在步骤210，由智能终端获得第一分辨率的多个第一图像。In step 210, a plurality of first images with a first resolution are obtained by the smart terminal.

如上，该多个第一图像可以从作为预览图像的(动态连续的)多个第三图像缩放而来，在此缩放指的是缩小分辨率。As described above, the plurality of first images may be scaled from the plurality of (dynamically continuous) third images as preview images, where scaling refers to reducing the resolution.

当然，也可以直接将预览图像作为第一图像，而不进行缩放步骤，这可以取决于预览图像的分辨率以及传输速率的要求(例如，预览图像的分辨率已经较小，或智能终端与计算设备之间的传输带宽较大足以快速传输较大数量的预览图像等)。Of course, the preview image can also be directly used as the first image without performing the scaling step, which may depend on the resolution of the preview image and the transmission rate requirements (for example, the resolution of the preview image is already small, or the transmission bandwidth between the smart terminal and the computing device is large enough to quickly transmit a large number of preview images, etc.).

在步骤220，将多个第一图像发送到与智能终端相连接的计算设备。In step 220 , a plurality of first images are sent to a computing device connected to the smart terminal.

在此，由于例如手机的计算设备的计算资源和算力通常高于例如智能眼镜的智能终端。因此可以利用计算设备来进行姿势识别的识别计算。计算设备可以在多个第一图像中识别第一姿势，并确定多个第一图像中被所述第一姿态所指点区域的起点和终点。该第一姿势可以被设置为能够体现出用户想要翻译一部分文本的意图的翻译动作。在此可以采用各种图像识别算法来进行姿势识别，包括各种机器学习算法等。Here, since the computing resources and computing power of a computing device such as a mobile phone are generally higher than those of an intelligent terminal such as smart glasses, a computing device can be used to perform recognition calculations for gesture recognition. The computing device can recognize a first gesture in a plurality of first images, and determine the start and end points of the area pointed to by the first gesture in the plurality of first images. The first gesture can be set to a translation action that can reflect the user's intention to translate a portion of the text. Various image recognition algorithms can be used to perform gesture recognition, including various machine learning algorithms, etc.

在本文中，假设当用户想要翻译一部分文本的时候，通常用户会伸出至少一个指头(例如通常食指)或用指点笔在要翻译的一部分文本的起始点位置停留一段时间，然后滑动食指或用指点笔，直到用户不想翻译的文本，在终点处，用户会用食指或用指点笔在要翻译的一部分文本的终止点位置再停留一段时间，用该起始点位置和终止点位置来表示用户对整个文本中的感兴趣进行翻译的这一部分文本。In this article, it is assumed that when a user wants to translate a part of a text, the user will usually extend at least one finger (such as the index finger) or use a pointing pen to stay at the starting point position of the part of the text to be translated for a period of time, and then slide the index finger or the pointing pen until the text that the user does not want to translate is reached. At the end point, the user will use the index finger or the pointing pen to stay at the ending point position of the part of the text to be translated for a period of time, and use the starting point position and the ending point position to indicate the part of the text that the user is interested in translating in the entire text.

基于上述假设，可以设计识别用户的对一部分文本的翻译意图的多种识别方式。Based on the above assumptions, multiple recognition methods can be designed to recognize the user's translation intention for a portion of text.

在一个实施例中，在此，在用人的手指作为指点物体的情况下，该第一姿势可以包括至少一个指头伸出且在多个第一图像一个固定位置上停留超过第一时间阈值。In one embodiment, when a human finger is used as the pointing object, the first gesture may include at least one finger extending out and staying at a fixed position in the plurality of first images for more than a first time threshold.

如图3所示，这是人的右手，其右手的食指伸出，这是较普遍的人手的指点习惯。当然，本公开不限于人的右手，如果人习惯用左手，也可以如本文所阐述的类似的方式来进行识别。As shown in Figure 3, this is the right hand of a person, and the index finger of the right hand is extended, which is a common pointing habit of a person's hand. Of course, the present disclosure is not limited to the right hand of a person, and if a person is used to using the left hand, identification can also be performed in a similar manner as described herein.

为了准确地识别这种姿势，可以将手的21个部位设置成21个关键点(图中未示出)。为简便起见，在图3中仅标出手的关键点8、5、17，其中关键点8位于右手食指的指甲盖处，关键点5位于食指的掌指关节处，关键点17位于小拇指的掌指关节处。In order to accurately recognize this gesture, 21 parts of the hand can be set as 21 key points (not shown in the figure). For simplicity, only key points 8, 5, and 17 of the hand are marked in Figure 3, where key point 8 is located at the nail of the right index finger, key point 5 is located at the metacarpophalangeal joint of the index finger, and key point 17 is located at the metacarpophalangeal joint of the little finger.

如果在图像中确定如下两个条件满足，则可以认为识别到了“食指伸出”状态(可以称为翻译手势)：(1)关键点8在其余20个关键点(即手的21个关键点中除关键点8之外的其余关键点)的左上方，即判断在图像中的8号点的横坐标和纵坐标在21个关键点中都为最小值；(2)关键点5在关键点17的左侧，即判断在图像中的关键点5的横坐标小于关键点17的横坐标。If the following two conditions are met in the image, it can be considered that the "index finger extended" state (which can be called a translation gesture) has been recognized: (1) key point 8 is at the upper left of the other 20 key points (i.e., the remaining key points of the hand except key point 8), that is, the horizontal coordinate and vertical coordinate of point 8 in the image are both minimum values among the 21 key points; (2) key point 5 is to the left of key point 17, that is, the horizontal coordinate of key point 5 in the image is smaller than the horizontal coordinate of key point 17.

当然上述识别“食指伸出”的状态采用了关键点的方式，但是本公开不限于此，实际上其他识别“食指伸出”的状态的方式也是可以的，例如采用一些机器学习算法等。而且也可以识别其他手指伸出，或识别至少一个手指伸出等等。Of course, the above recognition of the "index finger extended" state adopts the key point method, but the present disclosure is not limited to this. In fact, other methods of recognizing the "index finger extended" state are also possible, such as using some machine learning algorithms, etc. It is also possible to recognize that other fingers are extended, or recognize that at least one finger is extended, etc.

在识别到了“食指伸出”状态之后，进一步确定食指在多个第一图像一个固定位置上停留超过第一时间阈值。设置第一时间阈值是为了防止在用户并无翻译意图、而只是用手指点了一下某处、没有停留足够长时间的情况下的误识别。After the "index finger extended" state is identified, it is further determined that the index finger stays at a fixed position of the plurality of first images for more than a first time threshold. The first time threshold is set to prevent misidentification when the user has no intention of translating but just taps a certain place with the finger and does not stay there for a long enough time.

在用指点笔作为指点物体的情况下，该第一姿势可以包括指点笔的笔头在多个第一图像的一个固定位置上停留超过预定时间阈值。在该情况下，可以识别指点笔的笔头，并且确定笔头在多个第一图像一个固定位置上停留超过预定时间阈值。In the case where a pointing pen is used as the pointing object, the first gesture may include the pen tip of the pointing pen staying at a fixed position of the plurality of first images for more than a predetermined time threshold. In this case, the pen tip of the pointing pen may be identified, and it may be determined that the pen tip stays at a fixed position of the plurality of first images for more than a predetermined time threshold.

在此，预定时间阈值可以设置为t_s，取值范围可以是例如[500,2000]，单位是ms。当然，这仅是示例，该时间阈值还可以是其他取值。Here, the predetermined time threshold may be set to t _s , and the value range may be, for example, [500, 2000], and the unit is ms. Of course, this is only an example, and the time threshold may also be other values.

要确定在多个第一图像的一个固定位置上停留，可以通过检测食指指尖(例如关键点8)或笔头的坐标是否在小的一个区域内停留。这是因为食指或指尖不一定是完全静止的，通常可以取以t_s时间段内第一次检测到“翻译手势”的关键点8的坐标作为圆心，半径为20到50个像素的圆形区域。如果在r_s时间段内，翻译手势的关键点8的坐标一直在此圆形区域内，则可以认为翻译手势(食指伸出或者笔尖)在图像中的固定位置停留超过时间r_s，则确定识别到第一姿势。To determine whether the index finger (e.g., key point 8) or the pen tip stays at a fixed position in the plurality of first images, it is possible to detect whether the coordinates of the index finger tip (e.g., key point 8) or the pen tip stay in a small area. This is because the index finger or finger tip is not necessarily completely still, and a circular area with a radius of 20 to 50 pixels and the coordinates of the key point 8 where the "translation gesture" is first detected within the _{t s} period can usually be taken as the center of the circle. If within the r _s period, the coordinates of the key point 8 of the translation gesture are always within this circular area, it can be considered that the translation gesture (index finger extension or pen tip) stays at a fixed position in the image for more than the time r _s , and it is determined that the first gesture is recognized.

当然，上述翻译动作的第一姿势也可以设置为其他姿势，并用其他方式来进行识别，在此不一一举例。而上述识别食指伸出的方式也不是限制，而可以用其他方式来识别食指伸出。另外，指点物体也不限于食指或其他手指和指点笔，其他指点物体也适用于本公开的各个实施例。Of course, the first posture of the above translation action can also be set to other postures and recognized in other ways, which are not listed here one by one. The above method of recognizing the extension of the index finger is not a limitation, and other methods can be used to recognize the extension of the index finger. In addition, the pointing object is not limited to the index finger or other fingers and the pointing pen, and other pointing objects are also applicable to various embodiments of the present disclosure.

为了避免误识别和资源浪费，可以在智能终端设置为图像翻译模式之后，才开启对第一姿势的识别。In order to avoid misrecognition and waste of resources, the recognition of the first gesture may be enabled after the smart terminal is set to the image translation mode.

另外，当计算设备没有识别到指示翻译动作的第一姿势时，可以进行其他手势识别处理或直接停止识别。In addition, when the computing device does not recognize the first gesture indicating the translation action, other gesture recognition processes may be performed or recognition may be stopped directly.

接下来，如前所描述的，因为用户可能只想要翻译一部分文本，而不是整页文本，因此需要确定用户对哪部分文本感兴趣。Next, as described above, because the user may only want to translate a portion of the text, rather than the entire page of text, it is necessary to determine which portion of the text the user is interested in.

可以假设用户用指点物体划过的被指点区域是用户感兴趣的翻译区域，因此在计算设备在多个第一图像中识别到第一姿势的识别判断之后，还可以确定多个第一图像中被所述第一姿态所指点区域(也称为被指点区域)的起点和终点，以便确定用户感兴趣的翻译区域。It can be assumed that the pointed area where the user passes by the pointing object is the translation area that the user is interested in. Therefore, after the computing device recognizes the recognition judgment of the first posture in multiple first images, the starting point and end point of the area pointed by the first posture in the multiple first images (also called the pointed area) can also be determined to determine the translation area that the user is interested in.

图4A示出了根据本申请的实施例的计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息的过程400的流程图。4A shows a flowchart of a process 400 in which a computing device recognizes a first gesture in a plurality of first images and generates information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images according to an embodiment of the present application.

如图4A所示，在一个实施例中，计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息包括：由计算设备进行如下步骤。As shown in FIG. 4A , in one embodiment, the computing device recognizes a first posture in a plurality of first images and determines information related to a start point and an end point of an area pointed to by the first posture in the plurality of first images, including: the computing device performs the following steps.

在此，如果计算设备确定智能终端设置了图像翻译模式，则可以进行初始化，例如设置第一姿势的状态为第一状态。该第一状态可以是设置的“未获取焦点”状态，以表示当前是一轮新的姿势识别过程，避免与先前的姿势识别过程相冲突。Here, if the computing device determines that the intelligent terminal has set the image translation mode, it can be initialized, for example, setting the state of the first gesture to the first state. The first state can be a set "no focus" state to indicate that the current is a new round of gesture recognition process to avoid conflict with the previous gesture recognition process.

在步骤410，在多个第一图像中识别到第一姿势。At step 410 , a first gesture is identified in a plurality of first images.

由于该第一姿势可能是用户在被指点区域的起点处作出的，也可能是用户在被指点区域的终点处作出的，因此，需要判断此时的第一姿势是在起点还是在终点。具体地，在步骤420，如果先前未识别到第一姿势(例如是第一次识别)或第一姿势的状态为第一状态(即初始化之后的新一轮姿势识别)，则记录此时在多个第一图像中的第一姿势所指的位置为第一起点，设置此时第一姿势的状态为第二状态。Since the first gesture may be made by the user at the starting point of the pointed area or at the end point of the pointed area, it is necessary to determine whether the first gesture at this time is at the starting point or the end point. Specifically, in step 420, if the first gesture has not been recognized before (for example, it is the first recognition) or the state of the first gesture is the first state (i.e., a new round of gesture recognition after initialization), the position pointed to by the first gesture in the multiple first images at this time is recorded as the first starting point, and the state of the first gesture at this time is set to the second state.

在此，该第一状态的第一姿势所指的位置可以是前述圆形区域的圆心位置。Here, the position indicated by the first posture in the first state may be the center position of the aforementioned circular area.

在此，第二状态可以是“获取焦点”状态。Here, the second state may be a “focus acquisition” state.

在步骤430，如果在多个第一图像中再次识别到第一姿势且第一姿势的状态为第二状态，则记录此时在多个第一图像中第一姿势所指的位置为第一终点。In step 430 , if the first gesture is recognized again in the plurality of first images and the state of the first gesture is the second state, the position indicated by the first gesture in the plurality of first images at this time is recorded as the first end point.

在此，该第二状态的第一姿势所指的位置可以是前述圆形区域的圆心位置。Here, the position indicated by the first posture of the second state may be the center position of the aforementioned circular area.

如此，可以通过判断两次识别到的第一姿势是处于第一状态还是处于第二状态来判断被指点区域的起点和终点。In this way, the starting point and the end point of the pointed area can be determined by determining whether the first gestures recognized twice are in the first state or in the second state.

在步骤440，设置第一姿势的状态为第一状态。即表示已完成这一轮的识别过程，而开始新一轮的识别过程。In step 440, the state of the first posture is set to the first state, which means that this round of recognition process has been completed and a new round of recognition process begins.

接下来，由于第一图像、第二图像、第三图像的分辨率不同，因此各图像的坐标系的分辨率也不同，因此在第一图像中识别到的被指点区域的起点和终点的坐标与对应的第三图像(预览图像)以及第二分辨率的第二图像(以高分辨率拍照的图像)中的相应位置的坐标不同，需要进行坐标转换。Next, since the resolutions of the first image, the second image, and the third image are different, the resolutions of the coordinate systems of each image are also different. Therefore, the coordinates of the start and end points of the pointed area identified in the first image are different from the coordinates of the corresponding positions in the corresponding third image (preview image) and the second image of the second resolution (image taken with high resolution), and coordinate conversion is required.

例如，第三图像(预览图像)的图像分辨率记为w_p*h_p，第三图像(预览图像)的坐标系中的一个点的坐标记为(x_p，y_p))，第三图像(预览图像)被缩放之后的第一图像(缩略图)的图像分辨率记为w_r*h_r，坐标系中的坐标记为(x_r，y_r))，第二分辨率的第二图像(以高分辨率拍照的图像)的图像分辨率记为w_t*h_t，坐标系的坐标记为(x_t，y_t)。在同一相机下，各坐标系之间满足：For example, the image resolution of the third image (preview image) is recorded as w _p *h _p , the coordinates of a point in the coordinate system of the third image (preview image) are recorded as (x _p , y _p )), the image resolution of the first image (thumbnail) after the third image (preview image) is scaled is recorded as w _r *h _r , the coordinates in the coordinate system are recorded as (x _r , y _r )), the image resolution of the second image (image taken with high resolution) with the second resolution is recorded as w _t *h _t , the coordinates in the coordinate system are recorded as (x _t , y _t ). Under the same camera, the coordinate systems satisfy the following:

具体地，在步骤450，根据第一分辨率的多个第一图像和第二分辨率的第二图像之间的坐标映射关系(例如上述公式1)，从第一起点和第一终点的坐标分别转换为第二图像中的第二起点和第二终点的坐标。Specifically, in step 450, according to the coordinate mapping relationship between multiple first images of the first resolution and the second image of the second resolution (such as the above formula 1), the coordinates of the first starting point and the first end point are respectively converted into the coordinates of the second starting point and the second end point in the second image.

如此，能够在后续基于第二分辨率的第二图像中的第二起点和第二终点的坐标，得到用户真正感兴趣来翻译的第二分辨率(即高分辨率)的那一部分图像(即感兴趣区域)来进行OCR和翻译。In this way, based on the coordinates of the second starting point and the second end point in the second image of the second resolution, the part of the image (i.e., the area of interest) of the second resolution (i.e., high resolution) that the user is actually interested in translating can be obtained for OCR and translation.

在步骤460，向智能终端发送第二起点和第二终点的坐标作为被所述第一姿态所指点区域的起点和终点。In step 460, the coordinates of the second starting point and the second end point are sent to the smart terminal as the starting point and the end point of the area pointed by the first gesture.

当然，在另一实施例中，也可以将步骤450和步骤460改为向智能终端发送第一图像(缩略图)中的第一起点和第一终点的坐标，以便在智能终端中根据第一分辨率的多个第一图像和第二分辨率的第二图像之间的坐标映射关系(例如上述公式1)，从第一起点和第一终点的坐标分别转换为第二图像中的第二起点和第二终点的坐标。在此，为了总结上述多个实施例(向智能终端发送第一起点和第一终点的坐标或向智能终端发送第二起点和第二终点的坐标)，可以将第一起点和第一终点的坐标和第二起点和第二终点的坐标概括为与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息。如下结合图4B阐述这个实施例。Of course, in another embodiment, step 450 and step 460 may be changed to sending the coordinates of the first starting point and the first end point in the first image (thumbnail) to the smart terminal, so that in the smart terminal, according to the coordinate mapping relationship between the multiple first images of the first resolution and the second image of the second resolution (for example, the above formula 1), the coordinates of the first starting point and the first end point are respectively converted into the coordinates of the second starting point and the second end point in the second image. Here, in order to summarize the above multiple embodiments (sending the coordinates of the first starting point and the first end point to the smart terminal or sending the coordinates of the second starting point and the second end point to the smart terminal), the coordinates of the first starting point and the first end point and the coordinates of the second starting point and the second end point can be summarized as information related to the start point and the end point of the area pointed to by the first gesture in the multiple first images. This embodiment is explained below in conjunction with Figure 4B.

图4B示出了根据本申请的实施例的计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息的另一过程400’的流程图。Figure 4B shows a flowchart of another process 400' in which a computing device according to an embodiment of the present application recognizes a first gesture in multiple first images and identifies information related to the start point and end point of the area pointed to by the first gesture in the multiple first images.

在步骤410’，在所述多个第一图像中识别到第一姿势。At step 410', a first gesture is identified in the plurality of first images.

在步骤420’，如果先前未识别到所述第一姿势或所述第一姿势的状态为第一状态，则记录此时在所述多个第一图像中的所述第一姿势所指的位置为第一起点，设置此时所述第一姿势的状态为第二状态。In step 420', if the first posture has not been recognized previously or the state of the first posture is the first state, the position indicated by the first posture in the multiple first images at this time is recorded as the first starting point, and the state of the first posture at this time is set to the second state.

在步骤430’，如果在所述多个第一图像中再次识别到所述第一姿势且所述第一姿势的状态为第二状态，则记录此时在所述多个第一图像中所述第一姿势所指的位置为第一终点。In step 430', if the first posture is recognized again in the multiple first images and the state of the first posture is the second state, the position indicated by the first posture in the multiple first images at this time is recorded as the first end point.

在步骤440’，设置所述第一姿势的状态为第一状态。In step 440', the state of the first posture is set to the first state.

在步骤450’，向所述智能终端发送所述第一起点和所述第一终点的坐标。In step 450', the coordinates of the first starting point and the first end point are sent to the smart terminal.

接下来，返回图2，在步骤230，智能终端收到计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息。Next, returning to FIG. 2 , in step 230 , the intelligent terminal receives a recognition judgment that the computing device recognizes a first gesture in a plurality of first images and information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images.

该被所述第一姿态所指点区域的起点和终点相关的信息可以包括未经转换坐标的所述第一起点和所述第一终点的坐标，或者转换为第二图像的坐标的第二起点和第二终点的坐标。注意，如果该被所述第一姿态所指点区域的起点和终点相关的信息包括未经转换坐标的所述第一起点和所述第一终点的坐标，则在由所述智能终端：根据所述第一分辨率的所述多个第一图像和所述第二分辨率的第二图像之间的坐标映射关系，从所述第一起点和所述第一终点的坐标分别转换为所述第二图像中的第二起点和第二终点的坐标作为所述被所述第一姿态所指点区域的起点和终点。The information related to the starting point and the end point of the area pointed by the first gesture may include the coordinates of the first starting point and the first end point without conversion, or the coordinates of the second starting point and the second end point converted into the coordinates of the second image. Note that if the information related to the starting point and the end point of the area pointed by the first gesture includes the coordinates of the first starting point and the first end point without conversion, then the smart terminal: according to the coordinate mapping relationship between the multiple first images of the first resolution and the second image of the second resolution, the coordinates of the first starting point and the first end point are converted into the coordinates of the second starting point and the second end point in the second image as the starting point and the end point of the area pointed by the first gesture.

接下来，在步骤240，触发智能终端用第二分辨率拍摄第二图像，并根据与被所述第一姿态所指点区域的起点和终点相关的信息在所述第二图像截取感兴趣区域，其中第二分辨率高于第一分辨率。Next, in step 240, the smart terminal is triggered to capture a second image with a second resolution, and an area of interest is captured in the second image according to information related to the start and end points of the area pointed to by the first gesture, wherein the second resolution is higher than the first resolution.

具体地，在接收到计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息之后，由智能终端：根据被所述第一姿态所指点区域的起点和终点确定感兴趣区域(Region Of Interest，ROI)；用第二分辨率截取感兴趣区域，并发送给计算设备以便计算设备根据感兴趣区域进行识别和翻译。Specifically, after receiving the recognition judgment of the computing device that a first posture is recognized in multiple first images and the information related to the starting point and end point of the area pointed by the first posture in the multiple first images, the intelligent terminal: determines a region of interest (ROI) according to the starting point and end point of the area pointed by the first posture; captures the region of interest with a second resolution, and sends it to the computing device so that the computing device can recognize and translate according to the region of interest.

在一个实施例中，根据被所述第一姿态所指点区域的起点和终点确定感兴趣区域包括：将被所述第一姿态所指点区域的起点和终点分别作为感兴趣区域的矩形框的左上方点和右下方点；或将被所述第一姿态所指点区域的起点和终点分别作为感兴趣区域的矩形框的左下方点和右下方点，且感兴趣区域的矩形框具有预定高度。In one embodiment, determining the area of interest based on the starting point and end point of the area pointed by the first gesture includes: using the starting point and end point of the area pointed by the first gesture as the upper left point and lower right point of the rectangular frame of the area of interest, respectively; or using the starting point and end point of the area pointed by the first gesture as the lower left point and lower right point of the rectangular frame of the area of interest, respectively, and the rectangular frame of the area of interest has a predetermined height.

如图5A所示，假设用户习惯或者教育用户养成习惯来用至少一个指头从感兴趣区域的左上角的起点滑动到右下角的终点，来用起点和终点分别作为感兴趣区域的矩形框的左上方点和右下方点。As shown in FIG5A , it is assumed that the user is accustomed to or is educated to develop the habit of sliding at least one finger from the starting point at the upper left corner of the area of interest to the end point at the lower right corner, and using the starting point and the end point as the upper left point and lower right point of the rectangular frame of the area of interest, respectively.

如图5B所示，假设用户习惯或者教育用户养成习惯来用至少一个指头从感兴趣区域的左下角的起点滑动到右下角的终点，来用起点和终点分别作为感兴趣区域的矩形框的左下方点和右下方点，而假设矩形框具有一预定高度。该预定高度可以是一行或几行。As shown in FIG5B , it is assumed that the user is accustomed to or the user is educated to form a habit of sliding at least one finger from the starting point at the lower left corner of the region of interest to the end point at the lower right corner, and the starting point and the end point are used as the lower left point and the lower right point of the rectangular frame of the region of interest, respectively, and it is assumed that the rectangular frame has a predetermined height. The predetermined height can be one or more lines.

当然上述两种方式仅是通过起点和终点来确定感兴趣区域的示例方式，但是本公开不限于此，也可以用其他方式来确定感兴趣区域，感兴趣区域的形状也可以有多种。Of course, the above two methods are only exemplary methods of determining the region of interest through the starting point and the end point, but the present disclosure is not limited thereto, and the region of interest may be determined in other ways, and the region of interest may have a variety of shapes.

返回参考图2，在步骤250，由智能终端向计算设备发送第二图像中被指点区域(即感兴趣区域)以便接收计算设备对感兴趣区域包含的文字进行识别和翻译并得到翻译后的译文。Referring back to FIG. 2 , in step 250 , the intelligent terminal sends the pointed area (ie, the region of interest) in the second image to the computing device so that the receiving computing device can recognize and translate the text contained in the region of interest and obtain the translated text.

如此，通过传输低分辨率的第一图像并得到用低分辨率的第一图像识别的被指点区域，从而获得对高分辨率的第二图像中被指点区域所涉及的部分进行翻译的翻译结果，不仅降低了传输数据量，减少了识别被指点区域的数据量，有针对性地仅对被指点区域进行翻译，又保证了被指点区域的翻译的准确度。In this way, by transmitting a low-resolution first image and obtaining the pointed area identified by the low-resolution first image, the translation result of the part of the pointed area in the high-resolution second image is obtained, which not only reduces the amount of transmitted data, but also reduces the amount of data for identifying the pointed area, and only translates the pointed area in a targeted manner, while ensuring the accuracy of the translation of the pointed area.

在计算设备对感兴趣区域进行识别和翻译的过程中，计算设备进行如下步骤：从感兴趣区域中识别文字；对识别的文字进行翻译；将翻译后的译文发送到智能终端。In the process of the computing device identifying and translating the region of interest, the computing device performs the following steps: identifying text from the region of interest; translating the identified text; and sending the translated text to the smart terminal.

接下来，由智能终端进行译文的渲染播报。Next, the smart terminal renders and broadcasts the translated text.

图6示出了根据本申请的实施例的由智能终端进行译文的渲染播报的过程600的流程图。图7示出了根据本申请的实施例的由智能终端进行译文的渲染播报的示意图。Fig. 6 shows a flow chart of a process 600 of rendering and broadcasting a translation by a smart terminal according to an embodiment of the present application. Fig. 7 shows a schematic diagram of rendering and broadcasting a translation by a smart terminal according to an embodiment of the present application.

在步骤601，根据第一分辨率的多个第一图像和第三分辨率的多个第三图像之间的坐标映射关系，从多个第一图像中的第一起点和第一终点的坐标分别转换为第三分辨率的图像中的第三起点和第三终点的坐标(如图7所示的两个黑圆点)。In step 601, according to the coordinate mapping relationship between the multiple first images of the first resolution and the multiple third images of the third resolution, the coordinates of the first starting point and the first end point in the multiple first images are respectively converted into the coordinates of the third starting point and the third end point in the image of the third resolution (as shown by the two black dots in Figure 7).

在此，如前所述，第三图像是预览图像，由于希望在用户能从智能眼镜上看到的预览图像中进行感兴趣区域中的那部分译文的渲染，因此要得到预览图像中感兴趣区域的坐标。Here, as mentioned above, the third image is a preview image. Since it is desired to render the portion of the translation in the region of interest in the preview image that the user can see on the smart glasses, the coordinates of the region of interest in the preview image must be obtained.

在步骤602，在第三图像中标示第三起点和第三终点以及第三起点和第三终点之间的连线(如图7所示的感兴趣区域的对角线)，并渲染由第三起点和第三终点共同确定的区域作为翻译区域。在该翻译区域中包括了待翻译文本。In step 602, the third starting point and the third end point and the line between the third starting point and the third end point (such as the diagonal line of the region of interest shown in FIG7 ) are marked in the third image, and the area jointly determined by the third starting point and the third end point is rendered as a translation area. The text to be translated is included in the translation area.

在步骤603，在用于预览的第三图像中的翻译区域上叠加翻译结果框(如图7所示的翻译结果框)。如图7所示，可以采用弹窗图像区域用弹窗形式来叠加翻译结果框。In step 603, a translation result frame (as shown in FIG7 ) is superimposed on the translation area in the third image for preview. As shown in FIG7 , the translation result frame can be superimposed in a pop-up window form in a pop-up window image area.

在步骤604，在翻译结果框中显示翻译后的译文。In step 604, the translated text is displayed in a translation result box.

在需要同步语音播报译文的情况下，智能终端还可以进行如下步骤。When a synchronous voice broadcast of the translation is required, the smart terminal can also perform the following steps.

在步骤605，初始地，将翻译后的译文初始化为未播报状态。该微播报状态可以导致该译文文本处于一种文本格式，例如灰色等。In step 605, initially, the translated text is initialized to a non-broadcast state. The micro-broadcast state may cause the translated text to be in a text format, such as gray.

在步骤606，根据翻译后的译文获取对应的语音音频。In step 606, the corresponding speech audio is obtained according to the translated text.

在步骤607，获取未播报状态的译文的长度和语音音频的长度。In step 607, the length of the translation and the length of the speech audio in the unbroadcasted state are obtained.

在步骤608，播报语音音频。在此可以使用译文文本来调用TTS服务(例如手机端或者云端)，生成译文文本所对应的语音音频。In step 608, the voice audio is broadcasted. Here, the translated text can be used to call a TTS service (eg, a mobile phone or a cloud) to generate the voice audio corresponding to the translated text.

在步骤609，获取当前已播报的语音音频的时长(记为t_s)。In step 609, the duration of the currently broadcasted voice audio (denoted as t _s ) is obtained.

在步骤610，根据未播报状态的译文的长度(记为l_c)(例如字符个数)和语音音频的长度(记为t_s)、和当前已播报的语音音频的时长(记为t_u)，计算译文的已播报的长度(记为l_r)。计算公式如下：In step 610, the length of the translated text that has been announced (referred to as l r ) is calculated based on the length of the unannounced translation (referred to as l _c ) (e.g., the number of characters), the length of the speech audio (referred to as _{t s} ₎ , and the duration of the currently announced speech audio (referred to as _tu ). The calculation formula is as follows:

在步骤611，与当前已播报的语音音频同步地针对已播报的长度的译文(如图7所示的已播报译文)进行渲染。例如，可以用高亮的文字颜色进行渲染，或者用加粗的格式进行渲染等等。剩余的译文文本即为如图7所示的待播报文本。In step 611, the translated text of the length that has been broadcast (the broadcast translated text shown in FIG. 7 ) is rendered synchronously with the currently broadcast voice audio. For example, it can be rendered in a highlighted text color, or in a bold format, etc. The remaining translated text is the text to be broadcast as shown in FIG. 7 .

在步骤612，如果语音音频播报完成，结束播报。此时，也可以设置翻译结果框在完成语音播报之后自动消失。当然，也可以通过其他流程控制提前结束翻译结果框的显示。In step 612, if the voice audio broadcast is completed, the broadcast is terminated. At this time, the translation result box can also be set to automatically disappear after the voice broadcast is completed. Of course, the display of the translation result box can also be terminated in advance through other process controls.

在步骤613，如果语音音频播报由于识别到指点物体的第一姿势而中断，结束播报。此时可以开始指点物体的新一轮第一姿势识别过程。In step 613, if the voice audio broadcast is interrupted due to the recognition of the first gesture of the pointing object, the broadcast is terminated and a new round of first gesture recognition process of the pointing object can be started.

如此，使用翻译结果框显示的好处在于，可以只聚焦用户感兴趣的翻译区域，翻译结果显示的内容少，而语音播报只处理用户选中的感兴趣的翻译区域，而不是整幅图像内容的翻译结果，避免了用户的播报选择操作。而且，根据本申请的实施例，使得与译文渲染同步地播报语音，让用户更直观地看到译文。In this way, the advantage of using the translation result box display is that it can only focus on the translation area that the user is interested in, and the translation result display has less content, and the voice broadcast only processes the translation area of interest selected by the user, rather than the translation result of the entire image content, avoiding the user's broadcast selection operation. Moreover, according to an embodiment of the present application, the voice is broadcast synchronously with the translation rendering, allowing the user to see the translation more intuitively.

如此，根据本申请的实施例，可以通过传输低分辨率图像且在低分辨率图像中识别特定姿势，减少了设备之间互联传输的图像数据量，而且根据低分辨率图像中识别的特定姿势的起点和终点，获取高分辨率的图像中的翻译区域作为感兴趣区域来传输，减少了设备之间互联传输的图像数据量。进一步地，可以从拍照的高分辨率的感兴趣区域的图像中进行OCR和翻译，确保了翻译区域的清晰度和翻译准确性。只对感兴趣区域进行翻译，减少了OCR和翻译计算量以及互联回传的数据量。感兴趣区域的翻译结果即为用户关注的翻译内容，而且智能眼镜端在获取翻译文本之后，可以直接进行渲染及语音合成播报，减少了用户选择播报的操作。Thus, according to the embodiments of the present application, the amount of image data transmitted between devices can be reduced by transmitting low-resolution images and identifying specific postures in the low-resolution images, and the translation area in the high-resolution image is obtained as the region of interest for transmission based on the starting point and end point of the specific posture identified in the low-resolution image, thereby reducing the amount of image data transmitted between devices. Furthermore, OCR and translation can be performed from the high-resolution image of the region of interest taken, ensuring the clarity and translation accuracy of the translation area. Only translating the region of interest reduces the amount of OCR and translation calculations and the amount of data transmitted back from the interconnection. The translation result of the region of interest is the translation content that the user is concerned about, and after obtaining the translated text, the smart glasses can directly render and perform speech synthesis broadcasting, reducing the operation of user selection broadcasting.

图8示出了根据本申请的实施例的一种在计算设备处的图像处理方法800的流程图。FIG. 8 shows a flowchart of an image processing method 800 at a computing device according to an embodiment of the present application.

该计算设备包括例如手机、平板电脑、电脑和云端服务器等。The computing device includes, for example, a mobile phone, a tablet computer, a computer, and a cloud server.

如图8所示，图像处理方法800包括：步骤810，接收从智能终端发来的第一分辨率的多个第一图像；步骤810，在多个第一图像中识别到第一姿势，并根据识别到的第一姿势确定在多个第一图像中被所述第一姿态所指点区域的起点和终点；步骤810，触发智能终端使用第二分辨率拍摄第二图像，并向智能终端发送被所述第一姿态所指点区域的起点和终点，以便智能终端根据被所述第一姿态所指点区域的起点和终点获得第二分辨率的感兴趣区域，感兴趣区域是由第一图像中的被所述第一姿态所指点区域的起点和终点映射到第二分辨率的第二图像中来共同确定的；步骤810，从智能终端接收感兴趣区域以便根据感兴趣区域来进行识别和翻译，其中第二分辨率高于第一分辨率。As shown in FIG8 , the image processing method 800 includes: step 810, receiving a plurality of first images of a first resolution sent from a smart terminal; step 810, identifying a first posture in the plurality of first images, and determining the starting point and the end point of the area pointed to by the first posture in the plurality of first images according to the identified first posture; step 810, triggering the smart terminal to shoot a second image using a second resolution, and sending the starting point and the end point of the area pointed to by the first posture to the smart terminal, so that the smart terminal obtains a region of interest of a second resolution according to the starting point and the end point of the area pointed to by the first posture, wherein the region of interest is jointly determined by mapping the starting point and the end point of the area pointed to by the first posture in the first image to the second image of the second resolution; step 810, receiving the region of interest from the smart terminal for identification and translation according to the region of interest, wherein the second resolution is higher than the first resolution.

图9示出了根据本申请的实施例的智能终端900的方框图。FIG. 9 shows a block diagram of a smart terminal 900 according to an embodiment of the present application.

在此，该智能终端900可以包括智能眼镜等。Here, the smart terminal 900 may include smart glasses and the like.

如图9所示，智能终端900包括拍摄装置910和收发装置920。拍摄装置910获得第一分辨率的多个第一图像，收发装置910将多个第一图像发送到与智能终端900相连接的计算设备。As shown in Fig. 9, the smart terminal 900 includes a photographing device 910 and a transceiver 920. The photographing device 910 obtains a plurality of first images of a first resolution, and the transceiver 920 sends the plurality of first images to a computing device connected to the smart terminal 900.

收发装置920收到计算设备在多个第一图像中识别到第一姿势的识别判断和与多个第一图像中被所述第一姿态所指点区域的起点和终点相关的信息，拍摄装置910触发用第二分辨率拍摄第二图像，并根据与被所述第一姿态所指点区域的起点和终点相关的信息在所述第二图像截取感兴趣区域，其中第二分辨率高于第一分辨率。The transceiver 920 receives a recognition judgment of a computing device recognizing a first posture in a plurality of first images and information related to the start point and end point of the area pointed to by the first posture in the plurality of first images, and the shooting device 910 triggers shooting of a second image with a second resolution, and captures an area of interest in the second image based on the information related to the start point and end point of the area pointed to by the first posture, wherein the second resolution is higher than the first resolution.

收发装置920向计算设备发送第二图像以便接收计算设备对感兴趣区域包含的文字第二图像进行识别和翻译的翻译后的译文。The transceiver 920 sends the second image to the computing device so as to receive the translated text after the computing device recognizes and translates the second image of the text contained in the region of interest.

图10示出了根据本申请的实施例的计算设备1000的方框图。在此，该计算设备1000可以包括诸如手机、平板电脑等移动终端。Fig. 10 shows a block diagram of a computing device 1000 according to an embodiment of the present application. Here, the computing device 1000 may include a mobile terminal such as a mobile phone, a tablet computer, etc.

该计算设备1000包括：接收装置1010，被配置为从智能终端发来的接收第一分辨率的多个第一图像；识别装置1020，被配置为在多个第一图像中识别到第一姿势，并根据识别到的第一姿势确定在多个第一图像中被所述第一姿态所指点区域的起点和终点；发送装置1030，被配置为触发智能终端使用第二分辨率拍摄第二图像，并向智能终端发送被所述第一姿态所指点区域的起点和终点，以便智能终端根据被所述第一姿态所指点区域的起点和终点获得第二分辨率的感兴趣区域，感兴趣区域是由第一图像中的被所述第一姿态所指点区域的起点和终点映射到第二分辨率的第二图像中来共同确定的；从智能终端接收感兴趣区域以便根据感兴趣区域进行识别和翻译，其中第二分辨率高于第一分辨率。The computing device 1000 includes: a receiving device 1010, configured to receive multiple first images of a first resolution from a smart terminal; an identifying device 1020, configured to identify a first posture in the multiple first images, and determine the starting point and end point of the area pointed by the first posture in the multiple first images according to the identified first posture; a sending device 1030, configured to trigger the smart terminal to shoot a second image using a second resolution, and send the starting point and end point of the area pointed by the first posture to the smart terminal, so that the smart terminal obtains a region of interest of a second resolution according to the starting point and end point of the area pointed by the first posture, wherein the region of interest is jointly determined by mapping the starting point and end point of the area pointed by the first posture in the first image to the second image of the second resolution; receiving the region of interest from the smart terminal for identification and translation according to the region of interest, wherein the second resolution is higher than the first resolution.

电子设备可以包括处理器(H1)；存储介质(H2)，耦合于处理器(H1)，且在其中存储计算机可执行指令，用于在由处理器执行时进行本申请的实施例的各个方法的步骤。The electronic device may include a processor (H1); a storage medium (H2) coupled to the processor (H1) and storing computer executable instructions therein for performing the steps of each method of an embodiment of the present application when executed by the processor.

处理器(H1)可以包括但不限于例如一个或者多个处理器或者或微处理器等。The processor (H1) may include, but is not limited to, one or more processors or microprocessors, etc.

存储介质(H2)可以包括但不限于例如，随机存取存储器(RAM)、只读存储器(ROM)、快闪存储器、EPROM存储器、EEPROM存储器、寄存器、计算机存储介质(例如硬碟、软碟、固态硬盘、可移动碟、CD-ROM、DVD-ROM、蓝光盘等)。The storage medium (H2) may include, but is not limited to, for example, random access memory (RAM), read-only memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, computer storage media (such as hard disk, floppy disk, solid-state drive, removable disk, CD-ROM, DVD-ROM, Blu-ray disc, etc.).

除此之外，该电子设备还可以包括(但不限于)数据总线(H3)、输入/输出(I/O)总线(H4)，显示器(H5)以及输入/输出设备(H6)(例如，键盘、鼠标、扬声器等)等。In addition, the electronic device may also include (but not limited to) a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and input/output devices (H6) (for example, a keyboard, a mouse, a speaker, etc.), etc.

处理器(H1)可以通过I/O总线(H4)经由有线或无线网络(未示出)与外部设备(H5、H6等)通信。The processor (H1) may communicate with external devices (H5, H6, etc.) through an I/O bus (H4) via a wired or wireless network (not shown).

存储介质(H2)还可以存储至少一个计算机可执行指令，用于在由处理器(H1)运行时执行本技术所描述的实施例中的各个功能和/或方法的步骤。The storage medium (H2) may also store at least one computer executable instruction for executing the various functions and/or method steps in the embodiments described in the present technology when the instruction is executed by the processor (H1).

在一个实施例中，该至少一个计算机可执行指令也可以被编译为或组成一种软件产品，其中一个或多个计算机可执行指令被处理器运行时执行本技术所描述的实施例中的各个功能和/或方法的步骤。In one embodiment, the at least one computer executable instruction may also be compiled into or constitute a software product, wherein one or more computer executable instructions are executed by a processor to perform the various functions and/or method steps in the embodiments described in the present technology.

如图12所示，计算机可读存储介质1220上存储有指令，指令例如是计算机可读指令1210。当计算机可读指令1210由处理器运行时，可以执行参照以上描述的各个方法。计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。例如，计算机可读存储介质1220可以连接于诸如计算机等的计算设备，接着，在计算设备运行计算机可读存储介质1220上存储的计算机可读指令1210的情况下，可以进行如上描述的各个方法。As shown in FIG12 , instructions are stored on a computer-readable storage medium 1220, and the instructions are, for example, computer-readable instructions 1210. When the computer-readable instructions 1210 are executed by a processor, the various methods described above can be executed. The computer-readable storage medium includes, but is not limited to, for example, volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. For example, the computer-readable storage medium 1220 may be connected to a computing device such as a computer, and then, when the computing device runs the computer-readable instructions 1210 stored on the computer-readable storage medium 1220, the various methods described above may be performed.

注意，在本公开中提及的优点、优势、效果等仅是示例而非限制，不能认为这些优点、优势、效果等是本申请的各个实施例必须具备的。另外，上述公开的具体细节仅是为了示例的作用和便于理解的作用，而非限制，上述细节并不限制本申请为必须采用上述具体的细节来实现。Note that the advantages, strengths, effects, etc. mentioned in this disclosure are only examples and not limitations, and it cannot be considered that these advantages, strengths, effects, etc. are required by each embodiment of this application. In addition, the specific details disclosed above are only for the purpose of illustration and facilitation of understanding, not limitation, and the above details do not limit this application to being implemented by adopting the above specific details.

本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的，可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇，指“包括但不限于”，且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”，且可与其互换使用，除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”，且可与其互换使用。The block diagrams of the devices, apparatuses, equipment, and systems involved in this disclosure are only illustrative examples and are not intended to require or imply that they must be connected, arranged, and configured in the manner shown in the block diagrams. As will be appreciated by those skilled in the art, these devices, apparatuses, equipment, and systems can be connected, arranged, and configured in any manner. Words such as "including," "comprising," "having," and the like are open words, referring to "including but not limited to," and can be used interchangeably therewith. The words "or" and "and" used herein refer to the words "and/or," and can be used interchangeably therewith, unless the context clearly indicates otherwise. The word "such as" used herein refers to the phrase "such as but not limited to," and can be used interchangeably therewith.

本公开中的步骤流程图以及以上方法描述仅作为例示性的例子并且不意图要求或暗示必须按照给出的顺序进行各个实施例的步骤。如本领域技术人员将认识到的，可以按任意顺序进行以上实施例中的步骤的顺序。诸如“其后”、“然后”、“接下来”等等的词语不意图限制步骤的顺序；这些词语仅用于引导读者通读这些方法的描述。此外，例如使用冠词“一个”、“一”或者“该”对于单数的要素的任何引用不被解释为将该要素限制为单数。The step flow charts and the above method descriptions in this disclosure are only illustrative examples and are not intended to require or imply that the steps of each embodiment must be performed in the order given. As will be appreciated by those skilled in the art, the order of the steps in the above embodiments can be performed in any order. Words such as "thereafter", "then", "next", etc. are not intended to limit the order of steps; these words are only used to guide the reader through the description of these methods. In addition, any reference to a singular element, such as using the article "one", "an", or "the", is not to be construed as limiting the element to the singular.

另外，本文中的各个实施例中的步骤和装置并非仅限定于某个实施例中实行，事实上，可以根据本申请的概念来结合本文中的各个实施例中相关的部分步骤和部分装置以构思新的实施例，而这些新的实施例也包括在本申请的范围内。In addition, the steps and devices in the various embodiments of this document are not limited to being implemented in a certain embodiment. In fact, according to the concept of this application, relevant partial steps and partial devices in the various embodiments of this document can be combined to conceive new embodiments, and these new embodiments are also included in the scope of this application.

以上描述的方法的各个操作可以通过能够进行相应的功能的任何适当的手段而进行。该手段可以包括各种硬件和/或软件组件和/或模块，包括但不限于硬件的电路、专用集成电路(ASIC)或处理器。Each operation of the method described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software components and/or modules, including but not limited to hardware circuits, application specific integrated circuits (ASICs) or processors.

可以利用被设计用于进行在此描述的功能的通用处理器、数字信号处理器(DSP)、ASIC、场可编程门阵列信号(FPGA)或其他可编程逻辑器件(PLD)、离散门或晶体管逻辑、离散的硬件组件或者其任意组合而实现或进行描述的各个例示的逻辑块、模块和电路。通用处理器可以是微处理器，但是作为替换，该处理器可以是任何商业上可获得的处理器、控制器、微控制器或状态机。处理器还可以实现为计算设备的组合，例如DSP和微处理器的组合，多个微处理器、与DSP核协作的微处理器或任何其他这样的配置。The various illustrated logic blocks, modules and circuits may be implemented or described using a general purpose processor, a digital signal processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic, discrete hardware components or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but as an alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, a microprocessor cooperating with a DSP core, or any other such configuration.

在此公开的方法包括用于实现描述的方法的姿势。方法和/或姿势可以彼此互换而不脱离权利要求的范围。换句话说，除非指定了姿势的具体顺序，否则可以修改具体姿势的顺序和/或使用而不脱离权利要求的范围。The methods disclosed herein include gestures for implementing the described methods. Methods and/or gestures may be interchangeable with one another without departing from the scope of the claims. In other words, unless a specific order of gestures is specified, the order and/or use of specific gestures may be modified without departing from the scope of the claims.

上述功能可以按硬件、软件、固件或其任意组合而实现。如果以软件实现，功能可以作为指令存储在切实的计算机可读介质上。存储介质可以是可以由计算机访问的任何可用的切实介质。通过例子而不是限制，这样的计算机可读介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光碟存储、磁碟存储或其他磁存储器件或者可以用于携带或存储指令或数据结构形式的期望的程序代码并且可以由计算机访问的任何其他切实介质。如在此使用的，碟(disk)和盘(disc)包括紧凑盘(CD)、激光盘、光盘、数字通用盘(DVD)、软碟和蓝光盘，其中碟通常磁地再现数据，而盘利用激光光学地再现数据。The above functions can be implemented by hardware, software, firmware or any combination thereof. If implemented in software, the functions can be stored as instructions on a tangible computer-readable medium. The storage medium can be any available tangible medium that can be accessed by a computer. By way of example and not limitation, such a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage device or any other tangible medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer. As used herein, a disk and a disc include a compact disc (CD), a laser disc, an optical disc, a digital versatile disc (DVD), a floppy disc and a blue disc, wherein the disc usually reproduces data magnetically, and the disc reproduces data optically using a laser.

因此，本公开还可以包括计算机程序产品，其中计算机程序产品可以进行在此给出的方法、步骤和操作。例如，这样的计算机程序产品可以是计算机软件包、计算机代码指令、具有有形存储(和/或编码)在其上的计算机指令的计算机可读的有形介质，该指令可由处理器执行以进行在此描述的操作。计算机程序产品可以包括包装的材料。Therefore, the present disclosure may also include computer program products, wherein the computer program products can perform the methods, steps and operations presented herein. For example, such computer program products can be computer software packages, computer code instructions, computer-readable tangible media having computer instructions tangibly stored (and/or encoded) thereon, which instructions can be executed by a processor to perform the operations described herein. The computer program product can include packaging materials.

此外，用于进行在此描述的方法和技术的模块和/或其他适当的手段可以在适当时由用户终端和/或基站下载和/或其他方式获得。例如，这样的设备可以耦接到服务器以促进用于进行在此描述的方法的手段的传送。或者，在此描述的各种方法可以经由存储部件提供，以便用户终端和/或基站可以在耦接到该设备或者向该设备提供存储部件时获得各种方法。此外，可以利用用于将在此描述的方法和技术提供给设备的任何其他适当的技术。In addition, the modules and/or other appropriate means for performing the methods and techniques described herein can be downloaded and/or otherwise obtained by the user terminal and/or base station when appropriate. For example, such a device can be coupled to a server to facilitate the transmission of the means for performing the methods described herein. Alternatively, the various methods described herein can be provided via a storage component so that the user terminal and/or base station can obtain the various methods when the storage component is coupled to the device or provided to the device. In addition, any other appropriate technology for providing the methods and techniques described herein to the device can be utilized.

为了例示和描述的目的已经给出了以上描述。此外，此描述不意图将本申请的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例，但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。The above description has been given for the purpose of illustration and description. In addition, this description is not intended to limit the embodiments of the present application to the forms disclosed herein. Although multiple example aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, changes, additions and sub-combinations thereof.

Claims

1. An image processing method for an intelligent terminal, comprising:

The smart terminal obtains a plurality of first images with a first resolution;

Sending the plurality of first images to a computing device connected to the smart terminal;

receiving a recognition determination that the computing device recognizes a first gesture in the plurality of first images and information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images;

triggering the smart terminal to capture a second image with a second resolution, and intercepting an area of interest in the second image according to information related to a start point and an end point of the area pointed by the first gesture, wherein the second resolution is higher than the first resolution;

The intelligent terminal sends the region of interest to the computing device so as to receive the translation after the computing device recognizes and translates the text contained in the region of interest.

2 . The method according to claim 1 , wherein the first gesture comprises at least one finger extending out and staying at a fixed position of the plurality of first images for more than a first time threshold.

3 . The method according to claim 1 , wherein the first gesture comprises a tip of a pointing pen staying at a fixed position of the plurality of first images for more than a predetermined time threshold.

4. The method according to claim 1, wherein the receiving of the recognition judgment that the computing device recognizes the first gesture in the plurality of first images and the information related to the start point and the end point of the area pointed by the first gesture in the plurality of first images comprises:

By the computing device:

recognizing a first gesture in the plurality of first images;

If the first posture has not been recognized before or the state of the first posture is the first state, the position indicated by the first posture in the plurality of first images at this time is recorded as the first starting point, and the state of the first posture at this time is set to the second state;

If the first gesture is recognized again in the plurality of first images and the state of the first gesture is the second state, recording the position indicated by the first gesture in the plurality of first images at this time as the first end point;

Setting the state of the first posture to a first state;

According to a coordinate mapping relationship between the plurality of first images of the first resolution and the second image of the second resolution, converting the coordinates of the first starting point and the first end point into the coordinates of the second starting point and the second end point in the second image respectively;

Sending the coordinates of the second starting point and the second end point to the smart terminal as the starting point and the end point of the area pointed by the first gesture;

By the intelligent terminal:

The region of interest is determined according to the starting point and the end point of the pointed region, wherein:

The starting point and the end point of the area pointed by the first gesture are used as the upper left point and the lower right point of the rectangular frame of the area of interest respectively; or the starting point and the end point of the area pointed by the first gesture are used as the lower left point and the lower right point of the rectangular frame of the area of interest respectively, and the rectangular frame of the area of interest has a predetermined height;

The region of interest is captured using a second resolution and sent to the computing device so that the computing device can recognize and translate according to the region of interest.

5. The method according to claim 4, wherein determining the region of interest according to the start point and the end point of the region pointed by the first gesture comprises:

The receiving device's recognition judgment that the computing device recognizes the first gesture in the plurality of first images and information related to the start point and the end point of the area pointed by the first gesture in the plurality of first images includes:

By the computing device:

recognizing a first gesture in the plurality of first images;

Setting the state of the first posture to a first state;

Sending the coordinates of the first starting point and the first end point to the smart terminal;

By the intelligent terminal:

According to a coordinate mapping relationship between the plurality of first images of the first resolution and the second image of the second resolution, the coordinates of the first starting point and the first end point are converted into the coordinates of the second starting point and the second end point in the second image respectively as the starting point and the end point of the area pointed by the first gesture;

6. The method according to claim 1, further comprising:

The intelligent terminal collects a plurality of third images for preview in real time using a third resolution;

reducing the plurality of third images to the plurality of first images,

The third resolution is greater than the first resolution and smaller than the second resolution.

7. The method according to claim 6, wherein the intelligent terminal further performs the following steps:

According to a coordinate mapping relationship between the plurality of first images of the first resolution and the plurality of third images of the third resolution, the coordinates of the first starting point and the first end point in the plurality of first images are converted into coordinates of a third starting point and a third end point in the image of the third resolution respectively;

marking a third starting point and a third end point and a line between the third starting point and the third end point in the third image, and rendering an area jointly determined by the third starting point and the third end point as a translation area;

superimposing a translation result frame on the translation area in the third image for preview;

The translated text is displayed in the translation result box.

8. The method according to claim 7, wherein the intelligent terminal further performs the following steps:

Initializing the translated text to an unbroadcasted state;

Obtaining corresponding voice audio according to the translated text;

Obtaining the length of the unbroadcasted translation and the length of the voice audio;

broadcasting the voice audio;

Get the duration of the currently played voice audio;

Calculate the announced length of the translation according to the length of the unannounced translation, the length of the voice audio, and the duration of the currently announced voice audio;

Rendering the translation of the length of the broadcasted speech synchronously with the currently broadcasted speech audio;

If the voice audio broadcast is completed, end the broadcast;

If the voice audio broadcast is interrupted due to recognition of the first gesture of the pointing object, the broadcast is terminated.

9. A smart terminal, comprising a photographing device and a transceiver device,

The photographing device obtains a plurality of first images of a first resolution, and the transceiver device sends the plurality of first images to a computing device connected to the smart terminal;

The transceiver receives a recognition judgment that the computing device recognizes a first gesture in the plurality of first images and information related to a start point and an end point of an area pointed to by the first gesture in the plurality of first images, and the shooting device triggers shooting of a second image with a second resolution, and captures an area of interest in the second image according to the information related to the start point and the end point of the area pointed to by the first gesture, wherein the second resolution is higher than the first resolution;

The transceiver sends the region of interest to the computing device so as to receive the translation after the computing device recognizes and translates the text contained in the image of interest.

10. An image processing method, comprising:

Receiving a plurality of first images with a first resolution sent from a smart terminal;

A first gesture is recognized in the multiple first images, and a start point and an end point of an area pointed by the first gesture in the multiple first images are determined according to the recognized first gesture; the smart terminal is triggered to shoot a second image using a second resolution and the start point and the end point of the area pointed by the first gesture are sent to the smart terminal, so that the smart terminal obtains an area of interest with a second resolution according to the start point and the end point of the area pointed by the first gesture, and the area of interest is jointly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image to the second image with the second resolution;

The region of interest is received from the smart terminal so as to perform text recognition and translation according to the region of interest, wherein the second resolution is higher than the first resolution.

11. A computing device comprising

A receiving device, configured to receive a plurality of first images of a first resolution sent from a smart terminal;

a recognition device configured to recognize a first gesture in the plurality of first images, and determine a start point and an end point of an area pointed to by the first gesture in the plurality of first images according to the recognized first gesture;

a sending device configured to trigger the smart terminal to capture a second image using a second resolution, and to send the start point and the end point of the area pointed by the first gesture to the smart terminal, so that the smart terminal obtains a region of interest at the second resolution according to the start point and the end point of the area pointed by the first gesture, wherein the region of interest is jointly determined by mapping the start point and the end point of the area pointed by the first gesture in the first image to the second image at the second resolution;

12. An electronic device, comprising:

A memory for storing instructions;

A processor, configured to read instructions in the memory and execute the method as claimed in any one of claims 1-8 and 10.

13. A non-transitory storage medium having stored thereon instructions,

When the instruction is read by a processor, the processor executes the method as claimed in any one of claims 1-8 and 10.

14. A computer program product comprising computer instructions,