CN114330374A

CN114330374A - Fusion scene perception machine translation method, storage medium and electronic equipment

Info

Publication number: CN114330374A
Application number: CN202011079936.XA
Authority: CN
Inventors: 徐传飞; 潘邵武; 王成录
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2022-04-12
Also published as: WO2022073417A1

Abstract

The application relates to the technical field of neural network machine translation, in particular to a fusion scene perception machine translation method, a storage medium and electronic equipment, wherein the fusion scene perception machine translation method comprises the following steps: acquiring a text to be translated and scene perception data; determining the scene of the electronic equipment according to the scene perception data; generating a scene label corresponding to a scene where the electronic equipment is located; the scene label and the text to be translated are jointly used as a source language sequence to be input into an encoder for translation for encoding, and encoded data fused with scene perception are obtained; and decoding and converting the coded data into a target language through a decoder for translation to obtain a translation result of the fusion scene perception. Scene sensing data are collected through electronic equipment to generate a scene label, and then the scene label and the text to be translated are subjected to fusion translation coding and decoding to obtain a translation result which accords with the scene of the text to be translated, so that the translation accuracy of the scene short text is greatly improved.

Description

Fusion scene perception machine translation method, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of neural network machine translation, in particular to a fusion scene perception machine translation method, a storage medium and electronic equipment.

Background

From early dictionary matching, regular Translation of a dictionary combined with linguistic expert knowledge and statistical Machine Translation based on a corpus, along with the improvement of computer computing capacity and explosive growth of data, deep Neural network-based Translation, namely, Neural Machine Translation (NMT) is increasingly widely applied. The NMT development process comprises two phases: the first stage (2014-2017) is based on NMT of Recurrent Neural Network (RNN), whose core Network architecture is an RNN; the second stage (2017 till now) is based on the NMT (NMT-Transformer) of the transform neural network, and the core network architecture of the second stage is a transform model. In the field of machine translation, RNNs are gradually replaced by transformers.

However, the current mainstream translation equipment or products based on NMT-Transformer face the same problem: the translation accuracy is low when the scenized short text is processed, mainly because the scenized short text is usually composed of a plurality of words or characters, the meaning of the scenized short text is closely related to the scene, but the scenized short text lacks key context information, so that the NMT-Transformer translation is inaccurate.

Disclosure of Invention

The embodiment of the application provides a fusion scene perception machine translation method, a storage medium and electronic equipment, wherein a scene label is generated based on scene perception data collected by the electronic equipment, the generated scene label and a text to be translated are used as a source language sequence together to perform fusion coding in a coding stage of a Transformer network and extract information in the source language sequence, finally, the information in the source language sequence is converted into a target language in a decoding stage of the Transformer network, and a translation result conforming to the text scene to be translated is obtained through decoding, so that the translation accuracy of a scenarized short text is greatly improved.

In a first aspect, an embodiment of the present application provides a fusion scene-aware machine translation method, which is used for an electronic device with a machine translation function, and includes: acquiring a text to be translated and scene perception data, wherein the scene perception data is acquired by the electronic equipment and is used for determining the scene of the electronic equipment; determining the scene of the electronic equipment according to the scene perception data; generating a scene label corresponding to a scene where the electronic equipment is located based on the scene; the scene label and the text to be translated are jointly used as a source language sequence to be input into an encoder for translation for encoding, and encoded data fused with scene perception are obtained; and decoding and converting the coded data sensed by the fusion scene through a decoder for translation to obtain a translation result sensed by the fusion scene.

For example, a mobile phone with a machine translation function can collect scene perception data, the collected scene perception data is used for determining a current scene of the mobile phone, and then a scene tag corresponding to the current scene of the mobile phone can be generated, and when the mobile phone is used for translation, the mobile phone can fuse the generated scene tag for encoding and decoding of machine translation, and then a translation result fused with scene perception is obtained.

In a possible implementation of the first aspect, the method further includes: determining the characteristics of the scene where the electronic equipment is located according to the scene perception data to obtain scene state data, wherein the scene state data is used for representing the scene where the electronic equipment is located; and carrying out classification statistics on the scene state data to determine the scene of the electronic equipment.

For example, a mobile phone with a machine translation function can determine characteristics of a current scene based on collected scene perception data, and record the determined scene characteristics as scene state information. The obtained scene state data is classified and counted to obtain scene types containing one or more scene state information, and each scene type corresponds to a scene where the mobile phone is located.

In a possible implementation of the first aspect, the method further includes: the scene perception data is acquired by a detection element arranged in the electronic equipment, and the detection element comprises at least one of a GPS element, a camera, a microphone and a sensor. The scene awareness data includes one or more of location data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data, and angular motion data.

For example, the mobile phone may continuously acquire current position data of the mobile phone through its GPS element, acquire surrounding sound data through its microphone, acquire temperature data, light intensity data, and the like of a current scene environment through its temperature sensor and ambient light sensor, acquire current angular motion data of the mobile phone through a gyroscope, and the like, and the mobile phone may also acquire feature object image data of the surrounding environment through a camera, capture image data of a text to be translated, and the like, some of the scene sensing data acquired by the mobile phone may be invalid when further used, and most of the scene sensing data is valid data when subsequently determining a scene where the mobile phone is located, and may be used to determine the scene where the mobile phone is located.

In a possible implementation of the first aspect, determining characteristics of a scene in which the electronic device is located according to the scene awareness data includes one or more of the following: determining the position name of the scene according to the position data; determining characteristic characters or characteristic objects in the scene according to one or more of characters and target objects in the image data, and determining environmental characteristics of the scene; determining a noise type or a noise level in the scene according to one or more of frequency, voiceprint and amplitude in the sound data, and determining that the scene belongs to the indoor or the outdoor; determining the motion state of the electronic equipment in the scene according to the acceleration data and the angular motion data; and determining the temperature level and the light intensity level of the scene according to the environment temperature data and the environment light intensity data, and determining whether the scene belongs to the indoor or the outdoor.

For example, the mobile phone may determine the location name of the scene through the location data, such as a certain mall or airport. The mobile phone can also determine the feature objects in the scene through the image data acquired by the camera, for example, if the acquired image data is subway seat and subway station information, the scene can be determined to be a scene of taking a subway, and if the text to be translated is the station information of the subway, the translation of the scene short text can be performed through the mobile phone fusing the scene of taking the subway. The mobile phone can also determine whether the mobile phone is indoors or outdoors currently through the collected sound data, and some sound data can also preliminarily judge the scene where the mobile phone is located, for example, the mobile phone determines that the mobile phone is outdoors currently through the collected construction noise, and determines that the mobile phone is indoors and possibly in the scene of playing mahjong through the collected sound of mahjong collision. The mobile phone can be used for determining the current motion state of the mobile phone through acceleration data acquired by an acceleration sensor and angular motion data acquired by a gyroscope, for example, the acceleration data in subway and bus riding scenes are different, and which type of vehicle is taken can be determined according to the acceleration data. The mobile phone can also acquire environment temperature data or environment light intensity data through the environment temperature sensor and the environment light sensor to determine whether the scene belongs to the indoor or the outdoor, the indoor temperature is lower than the outdoor temperature in summer, the indoor temperature is higher than the outdoor temperature in winter, the indoor light intensity is lower than the outdoor light intensity in daytime, and the light intensity is higher than the outdoor light intensity when the lamp light is turned on indoors at night.

In a possible implementation of the first aspect, the method further includes: determining a user motion state according to the scene perception data, wherein the user motion state is used for determining the characteristics of the scene; wherein the context awareness data comprises one or more of heart rate data, blood oxygen data.

For example, the cell-phone can be through connecting wearing equipment and acquire user's rhythm of the heart data, blood oxygen data etc. for example, gather user's rhythm of the heart data or blood oxygen data through smart watch or bracelet, and confirm whether the user is in the motion state, when user's motion or motion volume increase, the rhythm of the heart can increase, and blood oxygen volume also has change by a relatively large margin. Through the motion data of the user, the scene where the mobile phone is located can be further determined, for example, the heart rate data of the user is greatly changed when the user exercises in a gymnasium, and when translation service needs to be used in the gymnasium, the mobile phone can determine the fitness scene based on the change of the heart rate data to realize fusion of the scene-aware translation scenarized short text. The current altitude of the user can be determined through the heart rate data or the blood oxygen data, for example, in the process of mountain climbing activities of the user, the current scene of the user can be determined through the altitude, the position data and the like, so that the mobile phone can realize scene perception and translation of the scenized short text.

In a possible implementation of the first aspect, the method further includes: the sequence of the scene label and the text to be translated input into the encoder is determined based on the correlation degree between the text content in the text to be translated and the scene label; the larger the correlation degree between the text content in the text to be translated and the scene label is, the closer the input distance between the text content in the text to be translated and the scene label is.

The coded data fused with scene perception comprises scene characteristic information in the scene label extracted by the encoder in the coding process and text content information in the text to be translated, and the encoder extracts the scene characteristic information and the text content information according to the sequence of inputting the scene label and the text to be translated into the encoder.

For example, in a dining scene, because the relevance between the scene tag and the dish name is higher, the scene tag "retaurant" can be input into the encoder before the dish name in the menu text to be translated during encoding, and the scene characteristic information extracted by the encoder during encoding is closer to the dish name, so that the scene characteristic information has a greater influence on the translation of the dish name.

In a possible implementation of the first aspect, the method further includes: and the decoder selects words corresponding to the text content information in the target language based on the scene characteristic information to generate a translation result of the fusion scene perception.

For example, in a dining scene, the decoder selects a word corresponding to the dish name from the target language to form a translation based on the dining scene characteristic information, and finally, an accurate dish name translation result can be obtained.

In a possible implementation of the first aspect, the method further includes: the generating of the scene tag based on the scene state data is implemented by a classifier, and the encoder and the decoder are implemented by a neural network model. The classifier realizes classification calculation of the scene state data through a classification algorithm, wherein the classification algorithm comprises any one of a gradient lifting tree classification algorithm, a support vector machine algorithm, a logistic regression algorithm and an iterative binary tree 3-generation decision tree algorithm; the neural network model comprises a recurrent neural network machine translation model based on a Transformer network.

For example, a classifier in the mobile phone performs classification statistics on the scene state data, determines the scene where the mobile phone is located, and further generates a scene tag. When the mobile phone is used for translation, the mobile phone integrates the scene label and completes the translation of the scene short text through the trained NMT-Transformer translation model.

In a second aspect, the present application provides a readable medium, on which instructions are stored, and when executed on an interactive device, the instructions cause an electronic device to perform the above fusion scene-aware machine translation method.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions for execution by one or more processors of an electronic device, and a processor, which is one of the processors of the electronic device, for performing the fused scene-aware machine translation method described above.

Drawings

Fig. 1 is a schematic view of an application scenario of the fusion scenario-aware machine translation method according to the present application;

FIG. 2 is a schematic diagram illustrating an example of an error translation of a scenized short text by a current translation device;

FIG. 3 is a schematic diagram illustrating steps of a fusion scene-aware machine translation method according to the present application;

fig. 4 is a schematic diagram of a data conversion flow in the fusion scene-aware machine translation method according to the present application;

FIG. 5 is a schematic diagram illustrating a process of embedding a scene tag in a text to be translated by an encoding process according to the present application;

FIG. 6 is a schematic diagram illustrating an interface comparison of a short text translation result of a scene according to the present application;

FIG. 7 is a schematic diagram illustrating an interface comparison of another scenarized short text translation result according to the present application;

fig. 8 is a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application;

fig. 9 is a block diagram of a software configuration of the cellular phone 100 according to the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the technical solutions of the embodiments of the present application are described in further detail below by combining the drawings and the embodiments.

Fig. 1 is a schematic view of an application scenario of the method for integrating scene-aware machine translation according to the present application.

As shown in fig. 1, the scenario includes an electronic device 100 and an electronic device 200, where the electronic device 100 and the electronic device 200 are connected via a network and perform data information interaction, and the electronic device 100 or the electronic device 200 has a machine translation function. The user takes a picture, takes a short video through the electronic device 100, or directly inputs a text to be translated for translation, wherein the picture or the short video taken by the electronic device 100 and a voice instruction input to the electronic device 100 by the user need to be translated after being converted into text data to be translated through an image-text conversion function or a voice recognition function of the electronic device 100.

The electronic device 200 may be used to train the NMT-Transformer to enable it to fuse scene tags for translational encoding and translational decoding, and the electronic device 200 may also be used to train the classifier to enable it to generate scene tags based on scene awareness data. The NMT-Transformer and the classifier trained by the electronic device 200 can be transplanted into the electronic device 100 for use.

The electronic device 100 may perform translation through a self-contained translation function, or perform data interaction with the electronic device 200 by opening locally installed translation software or by opening an online translated webpage, so as to complete translation of the text to be translated.

The electronic device 100 is a terminal device interacting with a user, and is installed with application software or an application system capable of executing a Neural Network Machine Translation (NMT) based NMT-Transformer. The electronic device 100 may further have a man-machine interaction system installed thereon to recognize a voice instruction of a user requesting to perform a translation function, and the electronic device 100 may have a function of recognizing text on a picture or video to recognize the picture or video as text data for further translation.

Neural Network Machine Translation (NMT) is a Machine Translation method proposed in recent years. Compared with the traditional Statistical Machine Translation (SMT), the NMT can train a neural network which can be mapped from one sequence to another sequence, and the output can be a variable-length sequence, so that the NMT can obtain very good performance in Translation, conversation and text summarization. The NMT is actually an encoder (encoder) -decoder (decoder) system in which an encoder encodes a sequence of a source language and extracts information in the source language, which is then converted to another language, the target language, by the decoder, to complete the translation of the language. The current mainstream machine translation method is based on an NMT-Transformer, and its core network architecture is a Transformer network, that is, the functions of an encoder and a decoder in the NMT-Transformer are implemented through the Transformer network. The Transformer network core is a self-attention layer, that is, self-attention of a vector space is calculated, and self-attention can be understood as a correlation degree, the self-attention between two vectors with a large correlation degree is large, and the self-attention between two vectors with a small correlation degree or irrelevant is small or 0.

As mentioned above, the current NMT-Transformer translation method has the problem of inaccurate translation caused by lack of context when the short text translation is made into scenes. For example, as shown in FIG. 2, in a dining scenario of a restaurant, the current translation device may erroneously translate "batched whiting" into "cod suffering from re-creation", which is not much related to the real dish name (the correct translation result should be: fry cod). Therefore, in a scenario-based short text translation scenario, the existing machine translation method has low translation accuracy, and poor user experience is caused. The scenarized short text includes, but is not limited to, a menu name in a menu, a store name in a mall, a special term in an entry-exit card, etc. In order to solve the technical problem, the conventional solution at present is to add additional information of other dimensions to replace the context information of the text in the decoding stage or the post-processing stage of the machine translation to improve the translation accuracy of the short text, but the solution is essentially a secondary correction to the translation result, and the translation accuracy of the short text is still low because the added additional information is single and the matching accuracy with the text to be translated is not high; and the way of adding extra information can cause that the noise is large when decoding the extracted information, and the extracted information cannot be directly used in a Transformer network, so the method is not suitable for the current mainstream NMT-Transformer.

For solving the technical problem, the application provides a method for translating a fusion scene perception machine, which includes generating a scene label based on scene perception data collected by electronic equipment (for example, position data of a scene where the electronic equipment collects, noise data of the scene where the electronic equipment collects, picture data to be translated of the scene where the electronic equipment collects, and the like), performing fusion coding on the generated scene label and a text to be translated together as a source language sequence in a coding stage of a Transformer network, extracting information in the source language sequence, finally converting the information in the source language sequence into a target language in a decoding stage of the Transformer network, and decoding to obtain a translation result conforming to the text scene to be translated. According to the method and the device, the translation accuracy of the scene short text is greatly improved by fusing the scene label in the text to be translated.

The electronic device 100 or the electronic device 200 using the method for fusing the scene-aware machine translation can generate a corresponding scene tag based on scene-aware data acquired by the electronic device 100 in real time, and embed the scene tag as a part of a source language into a text to be translated for translation coding and translation decoding in a translation process, so as to finally obtain a short text translation result conforming to a scene where the electronic device 100 is located.

Compared with the traditional mode that the accuracy rate of short text translation is improved by adding extra information to replace context information in the decoding stage of machine translation, the scheme of the application can comprehensively fuse scene perception data, so that the generated scene label participates in the encoding stage and the decoding stage of machine translation in the whole process, and the improvement of the translation accuracy rate can be stably realized; compared with the additional information added in the prior art, the scene label fused in the scheme is more diversified, the accuracy of the scene label is higher, and the judgment of the scene label is not influenced due to the loss of certain scene perception data (for example, data loss caused by the fact that a certain perception element on the electronic equipment 100 is not started and the like); the scheme of the application also solves the problem that the noise of the information obtained by decoding and extracting is large due to the fact that extra information is directly added in the decoding stage, and accordingly user experience is improved.

It is understood that in the present application, the electronic device 100 includes, but is not limited to, a laptop computer, a tablet computer, a cell phone, a wearable device, a head-mounted display, a server, a mobile email device, a portable game player, a portable music player, a reader device, a television having one or more processors embedded or coupled therein, or other interrupting electronic device capable of accessing a network. The electronic device 100 can collect scene sensing data through its own sensor, Global Positioning System (GPS), camera, and the like, and the electronic device 100 can also be used to train the classifier so that it can generate a scene tag based on the scene sensing data.

It is understood that the electronic device 200 includes, but is not limited to, a cloud, a server, a laptop computer, a desktop computer, a tablet computer, and other network-accessible electronic devices having one or more processors embedded or coupled therein.

For convenience of description, the following describes the technical solution of the present application in detail by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as examples.

The specific process of the scheme of the present application is described in detail below with reference to fig. 3. As shown in fig. 3, the fusion scene-aware machine translation method of the present application includes the following steps:

301: the mobile phone 100 obtains a text to be translated and scene sensing data, and obtains scene state data based on the scene sensing data.

The method for acquiring the text to be translated includes directly inputting the text to be translated through an input interface of the mobile phone 100, or acquiring the text by taking a picture or shooting a video through the mobile phone 100, and the text to be translated may also be text data obtained by recognition and conversion of a user voice instruction, which is not limited herein.

For example, after the mobile phone 100 obtains a photo or a video by taking a picture, the mobile phone 100 extracts the photo taken or extracts text information in an image captured in the video by its own image recognition system, and converts the text information into a text to be translated.

The mobile phone 100 obtains a voice instruction sent by a user, for example, the user may send the voice instruction to the mobile phone 100 by waking up a voice assistant, and the mobile phone 100 recognizes text information in the voice instruction of the user through its own man-machine conversation system and converts the text information into a text to be translated.

The scene sensing data may be acquired by using an image or sound acquired by various detection elements such as a camera, a microphone, an infrared sensor, a depth sensor, and the like of the mobile phone 100.

Fig. 4 is a schematic diagram of a data conversion process in the fusion scene-aware machine translation method according to the present application. As shown in fig. 4, the mobile phone 100 may acquire scene sensing data through a microphone, a gyroscope, an acceleration sensor, a GPS, and a Computer Vision (CV), and the mobile phone 100 may also acquire health status data (for example, a PPG sensor acquires heart rate data, a blood oxygen detection sensor acquires blood oxygen data, and the like) through a health detection sensor on an intelligent wearable device such as a bracelet or a watch, or acquire step counting data through the bracelet or the watch, and the like, as one of the scene sensing data, which is not limited herein.

In addition, there may be one or more ways to acquire the same scene sensing data, for example, the position information may be acquired by the GPS, or the position information may be acquired by a method of acquiring positioning information through wifi signals, which is not limited herein.

Further, as shown in fig. 4, the mobile phone 100 may analyze the scene state data based on the collected scene perception data. In the prior art, one or more decision rules of the scene state data may be preset in the mobile phone 100.

As an example of the determination rule, the cellular phone 100 may determine whether the cellular phone 100 is indoors or outdoors according to the type or level of noise. For example, the mobile phone 100 can set the noise level in the range of 2-4 levels as the range of indoor noise, and set the noise level in the range of 4-6 levels as the range of outdoor noise. When the sound picked up by the microphone of the mobile phone 100 is recognized as level 2 noise, the corresponding scene state data analyzed by the mobile phone 100 is indoor noise. When the sound picked up by the microphone of the mobile phone 100 is recognized as 5-level noise, the corresponding scene state data analyzed by the mobile phone 100 is outdoor noise. The mobile phone 100 may set building noise (e.g., construction noise generated by a road surface construction device), traffic noise (e.g., car whistling, car engine sound, and friction sound between tires and the ground) as outdoor noise, play sound (e.g., notification play, music play) in public places such as shopping malls, airports, and platforms as indoor noise, and may set living noise (e.g., mahjong playing sound, and noise in other entertainment places) as indoor noise, where the noise types are identified based on different frequencies and different sound patterns of the noise, and in some embodiments, the mobile phone 100 may also simply set the frequency range of the sound as a judgment basis for the noise type, which is not limited herein. Therefore, the mobile phone 100 can obtain the scene state data, whether it is an indoor scene or an outdoor scene, by analyzing the noise type or the noise level of the sound (noise) collected by the microphone.

As an example of the determination rule, the mobile phone 100 may determine whether the current scene is a mall, a station, or the like according to the position information of the GPS and the data of the online map; the mobile phone 100 can also take a target to be translated (text or picture, etc., for example, take a menu) based on the GPS position data and CV, and analyze that the obtained scene state data is a restaurant or a restaurant in a certain shop.

In other embodiments, as an example of the determination rule, the mobile phone 100 may further obtain scene state data such as a running and riding state, step counting data, and a motion trail based on a gyroscope of the mobile phone, an acceleration sensor, position data measured by a GPS, heart rate data measured by a wearable device connected to the mobile phone 100, and the like.

In other embodiments, as an example of the determination rule, the mobile phone 100 may further analyze a vehicle scene where the user currently takes based on the image collected by the CV, for example, the image collected by the CV is a seat on a subway, a subway station report display screen interface, a bus seat, a station information diagram posted in a bus, and the like, and the mobile phone 100 may determine that the scene where the user is located is a scene of taking a subway or taking a bus.

It is understood that, when analyzing the scene state data, if some scene sensing data is missing, the mobile phone 100 may call other scene sensing data to replace the missing scene sensing data to determine the scene state data. For example, when the GPS of the mobile phone 100 is not turned on, the mobile phone 100 cannot collect the position data, and the mobile phone 100 may obtain the scene state data through data analysis such as sound collected by a microphone and environmental characteristics collected by an infrared sensor.

The scene state data may be obtained based on one or more scene awareness data, and generally if the scene is a simple and easily distinguishable scene, the mobile phone 100 may determine the scene state data based on less scene awareness data, for example, for a station or airport scene, the mobile phone 100 may only need to fuse position information acquired by a GPS, or a sound type in the station acquired by a microphone, and the like to determine the basic scene state. If the scene is complex and difficult to distinguish, the mobile phone 100 may need to integrate multiple sensing data to comprehensively determine the state of the basic scene, which is not limited herein.

It can be understood that the different manners for acquiring the text to be translated may correspond to different scene awareness data and manners for acquiring the scene awareness data, which are not limited herein.

For example, when the mobile phone 100 obtains a text to be translated by taking a picture or taking a video, the manner of the scene sensing data obtained by the mobile phone 100 may include scene feature image data (for example, image data is collected by CV, etc.), position data (for example, angular motion data is collected by a mobile phone gyroscope, position data is collected by a mobile phone GPS, etc.); sound data (e.g., sound data collected by a microphone, etc.), and so forth.

When a text to be translated is manually input, in this situation, the camera of the mobile phone 100 does not need to be turned on and is not turned on, the mobile phone 100 may not acquire image data through CV, and the manner of the scene sensing data acquired by the mobile phone 100 may include position data (for example, acquiring angular motion data measured by a gyroscope, acquiring position data through a GPS, and the like); sound data (e.g., sound data collected by a microphone, etc.), motion data (e.g., heart rate data collected by a smart watch, smart bracelet, acceleration data collected by an acceleration sensor, etc.), environmental data (e.g., ambient temperature data collected by a temperature sensor, ambient light intensity data collected by an ambient light sensor, etc.), and so forth. It is understood that in some scenarios, the camera of the handset 100 may also collect CV signals in the background while there is no camera interface on the interface of the handset 100.

When a text to be translated is acquired through a voice instruction, in order to prevent mutual interference of sound data and acquire scene perception data, in such a situation, a camera of the mobile phone 100 does not need to be started, the mobile phone 100 may not acquire image data through a CV, and the mode of the scene perception data acquired by the mobile phone 100 may include position data (for example, angular motion data measured by a gyroscope, position data acquired through a GPS, and the like); the user's motion state data (e.g., collecting heart rate data, blood oxygen data, etc. via a smart watch, smart bracelet, etc.), environmental data (e.g., collecting ambient temperature data via a temperature sensor, collecting ambient light intensity data via an ambient light sensor, etc.), and so forth. It is understood that in some scenarios, the camera of the handset 100 may also collect CV signals in the background while there is no camera interface on the interface of the handset 100.

It can be understood that the device for acquiring the scene awareness data and the device for determining the basic scene state may be the same electronic device (for example, the mobile phone 100 may acquire the scene awareness data and may directly analyze the scene state data), or may be different electronic devices, for example, the scene awareness data acquired by the mobile phone 100 may be sent to the server 200 for further analysis to obtain the scene state data; or scene perception data is collected through intelligent wearable devices such as watches and bracelets and is sent to the mobile phone 100 for further analysis to obtain scene state data, which is not limited herein.

302: the handset 100 generates a scene tag based on the obtained scene state data.

Specifically, the mobile phone 100 classifies the scene state data obtained by the analysis and labels the scene tags, and different scene state data may correspond to the same scene tag. Therefore, it can be understood that the correspondence relationship of the scene state data and the scene tag is a many-to-one or one-to-one correspondence relationship.

The generation of the scene labels by the handset 100 based on the scene state data may be accomplished by a pre-trained scene classifier. For example, the cell phone 100 may perform classification training by inputting the scene state data into a Gradient Boosting Decision Tree (GBDT) classifier, and label the same scene label for the scene state data under the same or similar categories. The scene state data can be specially collected sample scene state data used for training a classifier, or scene state data obtained through analysis in practical application of machine translation, and the scene state data can be accumulated along with time to form a scene state database, and corresponding scene labels can also be accumulated along with time to form a scene label database. Since the classifier algorithm occupies a relatively small storage space, the classifier training can be performed on the electronic device such as the mobile phone 100, and can also be completed on the server 200, which is not limited herein.

The GBDT classifier is a classifier using a GBDT algorithm, which is one of the best algorithms for fitting the true distribution in a conventional machine learning algorithm, and can be used for classification, regression and feature screening. The principle of the GBDT algorithm is that through multiple iterations, each iteration generates a weak classifier, and each classifier is trained on the residual error of the last classifier. The requirements for weak classifiers are generally simple enough and are low variance and high variance. Because the training process is to continuously improve the accuracy of the final classifier by reducing the bias. The decision tree used by the GBDT algorithm is the CART regression tree. During training, the GBDT classifier may classify the scene state data, and then may manually label the same scene label to the scene state data under the same or similar categories through the mobile phone 100 or the server 200, and form a scene label database with a plurality of scene labels obtained through training of a large amount of scene state sample data. For example, the mobile phone 100 may obtain scene state data based on GPS data, which is a known restaurant or a certain shopping mall, scene state data (such as indoor noise) obtained based on a noise type of a sound collected by a microphone, a photo of a target object taken based on CV, and a captured photo of a surrounding environment, which may be analyzed to obtain that the scene state data is a menu, and the scene state data may be combined to determine that the current scene is a menu translation scene in a restaurant, so that "restaurant" or "restaurant" may be labeled as a scene label.

It is to be understood that the mobile phone 100 or the server 200 may also train other classification algorithm models to generate a scene tag corresponding to the scene state data, where the other classification algorithms include, but are not limited to, a Support Vector Machine (SVM) algorithm, a Logistic Regression (LR) algorithm, an Iterative binary 3-generation (ID 3) decision tree algorithm, and the like, and are not limited herein.

303: the mobile phone 100 encodes the scene tag and the text to be translated together as a source language sequence through an encoder in the NMT-Transformer, and extracts the scene in the source language sequence and the text information to be translated.

Specifically, a scene label and a text to be translated are input into an encoder of an NMT-Transformer (an encoding layer in a Transformer network) for encoding, wherein the encoding layer in the Transformer network is realized by a multi-layer self-attention (self-attention) network, and an attention vector output by each layer of self-attention network is used as an input of a next layer of self-attention network.

In the process of inputting the scene tag and the text to be translated into the encoder for encoding, two parts need to be considered for embedding the scene tag into the text to be translated: firstly, the relevance between the content of the scene label and the type of text in the text to be translated is considered to be higher; and secondly, considering that the position of the scene label embedded in the text to be translated is closer to the text content with higher relevance.

Fig. 5 shows a process diagram of an encoding process for embedding a scene tag in a text to be translated. As shown in FIG. 5, for example, in a menu translation scenario, X₅、X₆、X₇、X₈Representing words, X, constituting scene labels₉、X₁₀、X₁₁、X₁₂Representing the words that make up the text to be translated. The text to be translated is text data obtained by converting the shot menu, and the scene label is a restaurant or restaurant. Wherein, the dish name in the menu has higher relevance with the scene label, and the price of the dish has lower or no relevance with the scene label. Therefore, when the scene tag and the text to be translated are input into the fransformer network, the input of the scene tag should be placed before the input of the dish name text, so that the scene tag is closer to the dish name text, for example, dish name + price + dish introduction in each line of text on the menu, and therefore, the scene tag "retaurant" can be input before each line of text on the menu is input into the fransformer network. It can be understood that the closer the text to be translated is to the scene tag, the more scene information is extracted from the scene tag while the text information is extracted from the text encoded by the transform network, that is, the more attention the scene tag has to the text, as shown in fig. 5, the strong correlation curve represents the text to be translated(ii) a Conversely, the farther the text to be translated is away from the scene tag, the less scene information is extracted from the scene tag while the text information is extracted from the encoding of the text by the transform network, that is, the less attention the scene tag has to the text is, the less the text is represented by the weak correlation curve.

For example, text data batered while 16.0M |19.0NM on a menu, before which restaurant is entered, the source language sequence for entering the transform network is: restaurant batered WHITING 16.0M |19.0 NM. Wherein, the distance between the retaurant and batered WHITING is closer, the attention of the retaurant to batered WHITING is larger, the influence on the translation result is larger, and under the influence of the retaurant, the batered WHITING is translated into the fry cod instead of the cod which is mistakenly translated into the fry cod before; and the distance between Restaurant and 16.0M |19.0NM is longer, the attention is small, the influence on the translation result is small, and the translation result is kept unchanged.

304: the mobile phone 100 decodes the scene information extracted from the scene tag and the text to be translated in the encoding stage and the text information to be translated into a translation expressed in the target language according to words by a decoder in the NMT-Transformer, and outputs a translation result.

Specifically, a decoder (a decoding layer in a Transformer network) selects a target language based on scene information extracted by an encoder in an encoding process, and decodes the target language to obtain a translated text. Wherein, the decoding layer in the Transformer network is also realized by a multi-layer self-attribute network.

It can be understood that, in the encoding stage of the NMT-Transformer, the Transformer network extracts the scene information in the embedded scene tag at the same time as extracting the text information to be translated, so that in the decoding stage of the NMT-Transformer, the Transformer network can directly select the target language for decoding based on the scene information in the scene tag, thereby obtaining a translation more conforming to the scene.

Fig. 6 is a schematic interface comparison diagram of a scenized short text translation result according to the present application. As shown in fig. 6(a), for a translation result interface of a conventional translation apparatus or translation device, a Fisherman's Basket to be translated is decoded by the following steps: fisher's basket, which is obviously a wrong translation result; as shown in fig. 6(b), for the translation result interface of the translation device to which the fusion scene-aware machine translation method of the present application is applied, based on the scene information extracted by the scene tag (retaurant), the translation obtained by decoding the dish Fisherman's Basket is a seafood mosaic, and the translation result is correct.

Another implementation scenario is described below in conjunction with fig. 7.

Fig. 7 is a schematic interface comparison diagram of another scenarized short text translation result according to the present application. As shown in fig. 7, the translation result display of the inbound card filled in the inbound/outbound scenario is shown. Wherein, the text to be translated shown in fig. 7(a) is a source language character of the entry card (declaration and customs barrier); FIG. 7(b) is a view showing a translation result of a conventional translation apparatus; fig. 7(c) shows a translation result of a translation device to which the fusion scene-aware machine translation method of the present application is applied.

With reference to fig. 3 and the related description thereof, the process of obtaining the translation result shown in fig. 7(c) by translating with the fusion context-aware machine translation method of the present application includes the following steps:

s1: the method comprises the steps of obtaining a text to be translated and scene perception data, and obtaining scene state data based on the scene perception data.

Specifically, on the one hand, by opening the camera of the mobile phone 100 to capture the outbound card page, the mobile phone 100 extracts the text to be translated on the captured outbound card through its own image recognition system.

On the other hand, the mobile phone 100 collects position data by a GPS, sound data by a microphone, environment feature image data by a CV, and the like as scene awareness data. Scene state data is correspondingly obtained based on the collected scene perception data, for example, whether a position or a nearby calibrated geographical position mark is an airport, a customs and the like is judged through position data collected by a GPS, whether the environment is an indoor environment or an outdoor environment is judged through sound data collected by a microphone, and whether the environment has a registration window, a registration form and the like is judged through CV collected environment characteristic image data. If some entry and exit scenes are limited to be used by the CV, the image data can be acquired through other scene sensing data instead of the CV as the scene sensing data, so that the scene state data can be acquired. The amount and type of scene awareness data collected by the handset 100 is not limited herein. Specifically, refer to step 301 and the related description above, and are not repeated herein.

S2: the scene label is generated as entry/exit based on the scene state data obtained in S1. The trained classifier in the mobile phone 100 can directly and quickly generate the entry and exit scene labels according to the obtained scene state data. Specifically, refer to step 302 and the related description above, which are not repeated herein.

S3: the mobile phone 100 encodes the generated entry and exit scene tag and the text to be translated together as a source language sequence through an encoder in the NMT-Transformer, and extracts entry and exit scene information and text information to be translated in the source language sequence. For the specific encoding process, reference is made to step 303 and the related description, which are not described herein again.

S4: the mobile phone 100 decodes entry and exit scene information and text information to be translated extracted from the entry and exit scene tag and the text to be translated in the encoding stage by words through a decoder in the NMT-Transformer to obtain a translation expressed in the target language, and outputs a translation result, as shown in fig. 7 (c). For the specific decoding process, reference is made to step 304 and the related description, which are not repeated herein.

In the translation result shown in FIG. 7(c), there are some specialized words on the entry-exit card, for example, Please print in capital letters, which accurately translate to Please fill in capital characters, wherein print correctly translates to fill in.

In contrast, in the conventional translation result shown in fig. 7(b), it is apparently wrong to translate the above-mentioned sentence, print in capital letters, into Please print in capital letters. Therefore, the translation result after the scene label (the entry and exit scene) is fused is more accurate, and the user experience is better.

As described above, in practical applications, the application program to which the scene-aware machine translation method of the present application is applied may be embedded in the mobile phone 100 to implement accurate translation of the scene-based short text; the mobile phone 100 can also send the text to be translated to the server 200 by installing application software and interacting with the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene perception machine translation method; the mobile phone 100 may also access the web page version translation engine through a browser of the mobile phone 100, send the text to be translated to the server 200 through interaction with the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method, which is not limited herein.

An exemplary structure of the electronic device 100 is given below in connection with an embodiment of the present application.

Fig. 8 shows a schematic structural diagram of the mobile phone 100 according to an embodiment of the present application.

The mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the mobile phone 100. In other embodiments of the present application, the handset 100 may include more or fewer components than shown, or some components may be combined, some components may be separated, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Among other things, processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In an embodiment of the application, the mobile phone 100 may train a scene classifier and train an encoder and a decoder of an NMT-Transformer through the processor 110, and in an actual scenarized short text translation process, the processor 110 processes scene perception data and a text to be translated acquired by the mobile phone 100 and executes the method for translating the fused scene perception machine described in the above steps 301 to 304.

In some embodiments, processor 110 may include one or more interfaces.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only an exemplary illustration, and does not limit the structure of the mobile phone 100. In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the cell phone 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the mobile phone 100 can be realized by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, the baseband processor, and the like. The mobile phone 100 communicates with the server 200 and transmits data through the wireless communication function.

The

antennas

1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the handset 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including wireless communication of 2G/3G/4G/5G, etc. applied to the handset 100. The wireless communication module 160 may provide solutions for wireless communication applied to the mobile phone 100, including Wireless Local Area Networks (WLANs), such as wireless fidelity (Wi-Fi) networks, Bluetooth (BT), Global Navigation Satellite Systems (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like.

In some embodiments, the antenna 1 of the handset 100 is coupled to the mobile communication module 150 and the antenna 2 is coupled to the wireless communication module 160 so that the handset 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The mobile phone 100 implements a display function through the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information. The mobile phone 100 displays the images or characters collected by the text to be translated of the short scenized text through the display screen 194, and the translation result of the text to be translated of the mobile phone 100 is also displayed through the display screen 194 to be fed back to the user.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The cell phone 100 may include 1 or N display screens 194, with N being a positive integer greater than 1.

The SIM card interface 195 is used to connect a SIM card.

The mobile phone 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like. The CV signal acquisition by the mobile phone 100 can also be realized by the shooting function, that is, the image of the scene where the mobile phone is located or the image of the text to be translated is acquired by the shooting function.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the handset 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the mobile phone 100. The internal memory 121 may be used to store computer-executable program code, which includes instructions. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The data storage area may store data (e.g., audio data, a phonebook, etc.) created during use of the handset 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 110 executes various functional applications of the cellular phone 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

In an embodiment of the present application, the processor 110 executes the fused scene aware machine translation method of the present application by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The mobile phone 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The handset 100 may be provided with at least one microphone 170C. In other embodiments, the handset 100 may be provided with two microphones 170C to achieve noise reduction functions in addition to collecting sound signals. In other embodiments, the mobile phone 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions. In the practice of the present application, the sound signals may be collected by the microphone 170C and the noise level or type of noise, etc. determined for the collected sound signals to further analyze the scene state data, such as whether indoors or outdoors.

The headphone interface 170D is used to connect a wired headphone.

The gyro sensor 180B may be used to determine the motion attitude of the cellular phone 100. In some embodiments, the angular velocity of the handpiece 100 about three axes (i.e., the x, y, and z axes) may be determined by the gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the mobile phone 100, calculates the distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the mobile phone 100 through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The acceleration sensor 180E can detect the magnitude of acceleration of the cellular phone 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the handset 100 is stationary. The method can also be used for identifying the gesture of the mobile phone, and is applied to horizontal and vertical screen switching, pedometers and other applications. In the implementation process of the present application, some scene state data, such as the running and riding state of the user, may be obtained based on the shake state data measured by the gyro sensor 180B and the acceleration data measured by the acceleration sensor 180E.

A distance sensor 180F for measuring a distance. The handset 100 may measure distance by infrared or laser. In some embodiments, taking a picture of a scene, the cell phone 100 may utilize the range sensor 180F to range for fast focus.

The ambient light sensor 180L is used to sense the ambient light level. The handset 100 may adaptively adjust the brightness of the display 194 according to the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the mobile phone 100 is in a pocket to prevent accidental touches. In the implementation process of the present application, the scene state data may be analyzed based on the ambient light brightness sensed by the ambient light sensor 180L, for example, whether the current scene is indoor or outdoor is determined.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The cellular phone 100 may receive a key input, and generate a key signal input related to user setting and function control of the cellular phone 100.

The software system of the mobile phone 100 may adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present invention uses an Android system with a layered architecture as an example to exemplarily illustrate a software structure of the mobile phone 100.

Fig. 9 is a block diagram of the software configuration of the cellular phone 100 according to the embodiment of the present invention.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 9, the application package may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 9, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide the communication functions of the handset 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, to prompt text messages in the status bar, to emit a prompt tone, to vibrate the cell phone 100, to flash an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, composition, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The following describes exemplary workflow of the software and hardware of the mobile phone 100 in conjunction with a menu translation scenario.

When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into an original input event (including an operation of turning on translation software, or turning on a camera 193, etc.). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video of a menu to be translated through a camera 193.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one example embodiment or technology disclosed herein. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

The present disclosure also relates to an operating device for performing the method. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, Application Specific Integrated Circuits (ASICs), or any type of media suitable for storing electronic instructions, and each may be coupled to a computer system bus. Further, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.

The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform one or more method steps. The structure for a variety of these systems is discussed in the description that follows. In addition, any particular programming language sufficient to implement the techniques and embodiments disclosed herein may be used. Various programming languages may be used to implement the present disclosure as discussed herein.

Moreover, the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, the present disclosure is intended to be illustrative, but not limiting, of the scope of the concepts discussed herein.

Claims

1. A fusion scene-aware machine translation method for an electronic device with machine translation functionality, the method comprising:

acquiring a text to be translated and scene perception data, wherein the scene perception data is acquired by the electronic equipment and is used for determining the scene of the electronic equipment;

determining the scene of the electronic equipment according to the scene perception data;

generating a scene label corresponding to a scene where the electronic equipment is located based on the scene;

the scene label and the text to be translated are jointly used as a source language sequence to be input into an encoder for translation for encoding, and encoded data fused with scene perception are obtained;

and decoding and converting the coded data sensed by the fusion scene through a decoder for translation to obtain a translation result sensed by the fusion scene.

2. The method of claim 1, wherein determining the scene in which the electronic device is located according to the scene awareness data comprises:

determining the characteristics of the scene where the electronic equipment is located according to the scene perception data to obtain scene state data, wherein the scene state data is used for representing the scene where the electronic equipment is located; and is

And carrying out classification statistics on the scene state data, and determining the scene of the electronic equipment.

3. The method of claim 2, wherein the scene awareness data is collected by a detection element disposed in the electronic device, the detection element comprising at least one of a GPS element, a camera, a microphone, and a sensor.

4. The method of claim 3, wherein the scene awareness data includes one or more of location data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data, and angular motion data.

5. The method of claim 4, wherein determining characteristics of a scene in which the electronic device is located based on the scene awareness data comprises one or more of:

determining the position name of the scene according to the position data;

determining characteristic characters or characteristic objects in the scene according to one or more of characters and target objects in the image data, and determining environmental characteristics of the scene;

determining a noise type or a noise level in the scene according to one or more of frequency, voiceprint and amplitude in the sound data, and determining that the scene belongs to the indoor or the outdoor;

determining the motion state of the electronic equipment in the scene according to the acceleration data and the angular motion data;

and determining the temperature level and the light intensity level of the scene according to the environment temperature data and the environment light intensity data, and determining whether the scene belongs to the indoor or the outdoor.

6. The method of claim 5, wherein determining the scene in which the electronic device is located according to the scene awareness data further comprises:

determining a user motion state according to the scene perception data, wherein the user motion state is used for determining the characteristics of the scene;

wherein the context awareness data comprises one or more of heart rate data, blood oxygen data.

7. The method of claim 6, wherein the order in which the scene tag and the text to be translated are input into the encoder is determined based on a correlation between text content in the text to be translated and the scene tag;

the larger the correlation degree between the text content in the text to be translated and the scene label is, the closer the input distance between the text content in the text to be translated and the scene label is.

8. The method according to claim 7, wherein the encoded data fused with scene awareness comprises scene feature information in the scene tag and text content information in the text to be translated extracted by the encoder during encoding, and

and the encoder extracts the scene characteristic information and the text content information according to the sequence of inputting the scene label and the text to be translated into the encoder.

9. The method of claim 8, wherein the decoder selects a word in the target language corresponding to the text content information based on the scene trait information, and generates the fused scene-aware translation result.

10. The method of claim 9, wherein the generating a scene tag based on the scene state data is implemented by a classifier, and wherein the encoder and the decoder are implemented by a neural network model.

11. The method according to claim 10, wherein the classifier implements classification calculation of the scene state data through a classification algorithm, the classification algorithm including any one of a gradient lifting tree classification algorithm, a support vector machine algorithm, a logistic regression algorithm, an iterative binary tree 3-generation decision tree algorithm;

the neural network model comprises a recurrent neural network machine translation model based on a Transformer network.

12. A readable medium having stored thereon instructions which, when executed on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 11.

13. An electronic device, comprising:

a memory for storing instructions for execution by one or more processors of the electronic device, an

A processor, being one of the processors of an electronic device, for performing the method of any one of claims 1 to 11.