WO2022073417A1

WO2022073417A1 - Fusion scene perception machine translation method, storage medium, and electronic device

Info

Publication number: WO2022073417A1
Application number: PCT/CN2021/119655
Authority: WO
Inventors: 徐传飞; 潘邵武; 王成录
Original assignee: 华为技术有限公司
Priority date: 2020-10-10
Filing date: 2021-09-22
Publication date: 2022-04-14
Also published as: CN114330374A

Abstract

A fusion scene perception machine translation method, a storage medium, and an electronic device, which relate to the technical field of neural network machine translation. The fusion scene perception machine translation method comprises: obtaining text to be translated and scene perception data; determining, according to the scene perception data, the scene in which an electronic device is located; generating a scene tag corresponding to the scene in which the electronic device is located; jointly using the scene tag and said text as a source language sequence and inputting same into an encoder that is used for translation, performing encoding, and obtaining encoded data of fusion scene perception; and by means of a decoder used for translation, decoding the encoded data and converting same into a target language, and obtaining a translation result for fusion scene perception. By means of the electronic device, scene perception data is collected and a scene tag is generated, then fusion, translation, encoding, and decoding are performed on the scene tag and text to be translated, and a translation result that is in line with the scene of said text is obtained, thus greatly improving the translation accuracy of scene short text.

Description

Fusion scene-aware machine translation method, storage medium and electronic device

This application claims the priority of the Chinese patent application filed on October 10, 2020 with the application number 202011079936.X and the application name "Integrated scene-aware machine translation method, storage medium and electronic device", the entire content of which is Incorporated herein by reference.

technical field

The invention relates to the technical field of neural network machine translation, in particular to a fusion scene perception machine translation method, a storage medium and an electronic device.

Background technique

From the early dictionary matching, to the rule translation in which the dictionary combines the knowledge of linguistic experts, to the statistical machine translation based on the corpus, with the improvement of computer computing power and the explosive growth of data, the translation based on deep neural network, that is, neural machine translation. Translation (Neural Machine Translation, NMT) is more and more widely used. The development process of NMT consists of two stages: the first stage (2014-2017) is based on the Recurrent Neural Network (RNN) NMT, and its core network architecture is an RNN; the second stage (2017-present) is based on The NMT of the Transformer neural network (hereinafter referred to as NMT-Transformer), its core network architecture is a Transformer model. In the field of machine translation, RNN is gradually replaced by Transformer.

However, the current mainstream translation equipment or products based on NMT-Transformer face the same problem: the translation accuracy is low when dealing with scene-based short text, mainly because the scene-based short text is usually composed of several words or characters, and its meaning is the same as the However, the contextual information of short contextual texts is lacking, resulting in inaccurate translation by NMT-Transformer.

SUMMARY OF THE INVENTION

The embodiments of the present application provide a fusion scene-aware machine translation method, a storage medium, and an electronic device, where a scene label is generated based on scene-aware data collected by the electronic device, and the generated scene label and the text to be translated are used together as a source language sequence in the The encoding stage of the Transformer network performs fusion coding and extracts the information in the source language sequence. Finally, in the decoding stage of the Transformer network, the information in the source language sequence is converted into the target language, and the translation result conforming to the text scene to be translated is obtained by decoding. Greatly improves the translation accuracy of scene-based short texts.

In a first aspect, an embodiment of the present application provides a fusion scene-aware machine translation method for an electronic device with a machine translation function, including: acquiring text to be translated and scene-awareness data, where the scene-awareness data is stored by the electronic device collected and used to determine the scene where the electronic device is located; determine the scene where the electronic device is located according to the scene perception data; generate a scene corresponding to the scene based on the scene where the electronic device is located label; inputting the scene label and the text to be translated together as a source language sequence into an encoder for translation, to obtain encoded data of fusion scene perception; passing the encoded data of fusion scene perception through a decoder for translation Decoding and target language conversion are performed to obtain translation results that incorporate scene perception.

For example, a mobile phone with a machine translation function can collect scene perception data, and the collected scene perception data can be used to determine the scene where the mobile phone is currently located, and then can generate a scene label corresponding to the scene where the mobile phone is currently located. When using the mobile phone for translation, The mobile phone can fuse the generated scene tags to encode and decode machine translation, and then obtain the translation result of fusion scene perception.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: determining the characteristics of the scene in which the electronic device is located according to the scene perception data, so as to obtain scene state data, where the scene state data is used to represent The scene in which the electronic device is located; and the scene state data is classified and counted to determine the scene in which the electronic device is located.

For example, a mobile phone with a machine translation function can determine the characteristics of the current scene based on the collected scene perception data, and record the determined scene characteristics as scene state information. The above obtained scene state data is classified and counted to obtain scene types including one or more scene state information, and each scene type corresponds to a scene in which the mobile phone is located.

In a possible implementation of the first aspect, the method further includes: the scene perception data is acquired by a detection element set in the electronic device, the detection element includes a GPS element, a camera, a microphone, a sensor, etc. at least one of. The scene perception data includes one or more of position data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data, and angular motion data.

For example, the mobile phone can continuously collect the current location data of the mobile phone through its own GPS element, the surrounding sound data through its own microphone, and the temperature data and light intensity data of the current scene environment through its own temperature sensor and ambient light sensor, and The mobile phone can collect the current angular motion data of the mobile phone through the gyroscope. The mobile phone can also collect the feature image data of the surrounding environment and the image data of the text to be translated through the camera. Some of these scene perception data collected by the mobile phone may be used further. is invalid, and most of the scene perception data is valid data when determining the scene where the mobile phone is located, and can be used to determine the scene where the mobile phone is located.

In a possible implementation of the first aspect, the characteristics of the scene in which the electronic device is located is determined according to the scene perception data, including one or more of the following situations: determining the scene according to the position data The location name of the scene; according to one or more of the words in the image data and the object, determine the characteristic words or characteristic objects in the scene, and determine the environmental characteristics of the scene; according to the sound data One or more of the frequency, voiceprint, and amplitude of the scene to determine the noise type or noise level in the scene, and determine that the scene belongs to indoor or outdoor; according to the acceleration data and the angular motion data, determine The motion state of the electronic device in the scene; according to the ambient temperature data and the ambient light intensity data, determine the temperature level and light intensity level of the scene, and determine whether the scene belongs to indoors or outdoors.

For example, the mobile phone can determine the location name of the scene in which it is located, such as a shopping mall or an airport, through location data. The mobile phone can also determine the features in the scene through the image data collected by the camera. For example, if the collected image data is subway seats and subway station information, it can be determined that the scene is the scene of taking the subway. If the text to be translated is from the subway Site information, you can use the mobile phone to integrate the scene of taking the subway for scene-based short text translation. The mobile phone can also determine whether it is indoors or outdoors through the collected sound data, and some sound data can also preliminarily determine the scene. Make sure you are indoors and possibly playing mahjong. The acceleration data collected by the mobile phone through the acceleration sensor and the angular motion data collected by the gyroscope can be used to determine the current state of motion. For example, the acceleration data in the subway and bus scenarios are different, and the vehicle you are taking can be determined according to the acceleration data. . The mobile phone can also collect ambient temperature data or ambient light intensity data through the ambient temperature sensor and ambient light sensor to determine whether the scene it is in belongs to indoors or outdoors. Generally, the indoor temperature is lower than the outdoor temperature in summer, and the indoor temperature is higher than the outdoor temperature in winter. The light intensity is lower than the outdoor light intensity, and the indoor light intensity is higher than the outdoor light intensity at night.

In a possible implementation of the first aspect, the method further includes: determining a user motion state according to the scene perception data, where the user motion state is used to determine the characteristics of the scene; wherein the scene perception data Including one or more of heart rate data and blood oxygen data.

For example, the mobile phone can obtain the user's heart rate data, blood oxygen data, etc. by connecting to the wearable device. For example, the user's heart rate data or blood oxygen data can be collected through a smart watch or a wristband to determine whether the user is in a state of exercise. When the user is exercising or the amount of exercise increases When the heart rate is high, the heart rate will increase, and the blood oxygen level will also have a large change. Through the user's exercise data, the scene where the mobile phone is located can be further determined. For example, the heart rate data will change greatly when the user is exercising in the gym. When the translation service needs to be used in the gym, the mobile phone can be determined based on the change in the heart rate data. Fitness scene to achieve integrated scene-aware translation of scene-based short texts. The current altitude of the user can also be determined through heart rate data or blood oxygen data. For example, during the process of mountain climbing, the user can determine the current scene of the user through altitude and location data, so that the mobile phone can realize integrated scene perception Translate contextualized short texts.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: an order in which the scene label and the text to be translated are input into the encoder based on the text content in the to-be-translated text and the scene label The correlation between the texts to be translated is determined; the greater the correlation between the text content in the to-be-translated text and the scene label, the closer the input distance between the text content in the to-be-translated text and the scene label is.

The encoded data of the fusion scene perception includes the scene feature information in the scene label extracted by the encoder during the encoding process and the text content information in the text to be translated, and the encoder is based on the scene label and the scene label. The sequence in which the text to be translated is input to the encoder extracts the scene feature information and the text content information.

For example, in the dining scene, since the scene label has a higher correlation with the dish name, the scene label "restaurant" can be placed before the dish name in the menu text to be translated and input into the encoder. The extracted scene feature information is closer to the dish name, so the scene feature information has a greater impact on the translation of the dish name.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the decoder selects a word corresponding to the text content information in the target language based on the scene feature information, and generates the fusion Scene-aware translation results.

For example, in a dining scene, the decoder selects words corresponding to the dish name in the target language based on the feature information of the dining scene to form a translation, and finally can obtain an accurate dish name translation result.

In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the generating a scene label based on the scene state data is implemented by a classifier, and the encoder and the decoder are implemented by a neural network model. The classifier realizes the classification calculation of the scene state data through a classification algorithm, and the classification algorithm includes any one of a gradient boosting tree classification algorithm, a support vector machine algorithm, a logistic regression algorithm, and an iterative binary tree 3-generation decision tree algorithm. ; The neural network model includes a cyclic neural network machine translation model based on Transformer network.

For example, the scene state data is classified and counted by the classifier in the mobile phone to determine the scene in which the mobile phone is located, and then the scene label is generated. When using a mobile phone for translation, the mobile phone integrates scene tags to complete the translation of scene-based short texts through the trained NMT-Transformer translation model.

In a second aspect, an embodiment of the present application provides a readable medium, where an instruction is stored on the readable medium, and the instruction, when executed on an interactive device, causes the electronic device to execute the above-mentioned fusion scene-aware machine translation method.

In a third aspect, embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, for Perform the fusion scene-aware machine translation method described above.

Description of drawings

1 is a schematic diagram of an application scenario of the fusion scene-aware machine translation method of the present application;

Fig. 2 is a schematic diagram of an example of wrong translation of the current translation device when translating a scene-based short text;

3 is a schematic diagram of the steps of the fusion scene-aware machine translation method of the present application;

4 is a schematic diagram of a data conversion process in the fusion scene-aware machine translation method of the application;

5 is a schematic diagram of a process of embedding a scene tag in a text to be translated in an encoding process of the present application;

6 is a schematic diagram of interface comparison of a scene-based short text translation result according to the present application;

7 is a schematic diagram of interface comparison of another scene-based short text translation result of the application;

FIG. 8 is a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application;

FIG. 9 is a block diagram of a software structure of a mobile phone 100 according to an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the embodiments of the present application will be further described in detail below with reference to the accompanying drawings and embodiments.

FIG. 1 is a schematic diagram of an application scenario of the fusion scene-aware machine translation method of the present application.

As shown in FIG. 1 , the scene includes electronic device 100 and electronic device 200 , wherein electronic device 100 and electronic device 200 are connected through a network and perform data information exchange, and electronic device 100 or electronic device 200 has a machine translation function. The user uses the electronic device 100 to take pictures, shoot a short video, or directly input the text to be translated for translation. The image-to-text conversion function or the speech recognition function converts it into the text data to be translated and then translates it.

The electronic device 200 can be used to train an NMT-Transformer to enable it to fuse scene labels for translation encoding and translation decoding, and the electronic device 200 can also be used to train a classifier to enable it to generate scene labels based on scene-aware data. The NMT-Transformer and classifier trained by the electronic device 200 can be transplanted into the electronic device 100 for use.

The electronic device 100 can perform translation through its own translation function, or perform data interaction with the electronic device 200 by opening a locally installed translation software or opening an online translated webpage to complete the translation of the text to be translated.

The electronic device 100 is a terminal device that interacts with a user, on which is installed application software or an application system capable of executing NMT-Transformer based on Neural Machine Translation (NMT). A man-machine dialogue system may also be installed on the electronic device 100 to recognize a user's voice command requesting to perform a translation function, and the electronic device 100 may have a function of recognizing text on pictures or videos to recognize the pictures or videos as text data for further translation.

Neural Machine Translation (NMT) is a machine translation method proposed in recent years. Compared with traditional Statistical Machine Translation (SMT), NMT can train a neural network that can map from one sequence to another, and the output can be a variable-length sequence. Excellent performance in dialogue and text summarization. NMT is actually an encoder-decoder system, in which the encoder encodes the source language sequence, extracts the information in the source language, and then converts this information to another language through the decoder. language, so as to complete the translation of the language. The current mainstream machine translation method is based on NMT-Transformer, and its core network architecture is a Transformer network, which means that the encoder and decoder functions in NMT-Transformer are implemented through the Transformer network. Among them, the core of the Transformer network is the self-attention layer, which is to calculate the self-attention of the vector space. Self-attention can be understood as the degree of correlation. The self-attention between the two vectors is small or 0.

As mentioned above, the current NMT-Transformer translation method has the problem of inaccurate translation due to lack of context when translating short text in context. For example, as shown in Figure 2, in the dining scene of a restaurant, the current translation equipment may incorrectly translate "battered whiting" as "battered cod", which is different from the real dish name (the correct translation result should be: fried cod) is very irrelevant. Therefore, in the scene-based short text translation scenario, the existing machine translation methods have low translation accuracy, resulting in poor user experience. Scenario-based short texts include but are not limited to menu names in menus, store names in shopping malls, special terms in immigration cards, etc. In order to solve this technical problem, the current traditional solution is to add additional information of other dimensions in the decoding stage or post-processing stage of machine translation to replace the context information of the text to improve the translation accuracy of short texts, but the essence of this solution is The above is a secondary correction of the translation result. Since the additional information added is relatively simple and the matching accuracy with the text to be translated is not high, the translation accuracy of the short text is still relatively low; and this way of adding additional information will lead to When decoding and extracting information, the noise is large and cannot be directly used in the Transformer network, so it is not suitable for the current mainstream NMT-Transformer.

To solve the above technical problems, the present application provides a fusion scene-aware machine translation method. On the basis of NMT-Transformer, based on scene-awareness data collected by electronic equipment (for example, The noise data of the scene is used to generate the scene label with the picture data to be translated in the scene, etc.), and then the generated scene label and the text to be translated are used together as the source language sequence. Finally, in the decoding stage of the Transformer network, the information in the source language sequence is converted into the target language, and the translation result conforming to the text scene to be translated is obtained by decoding. The present application greatly improves the translation accuracy of scene-based short texts by incorporating scene tags into the text to be translated.

The electronic device 100 or the electronic device 200 to which the fusion scene-aware machine translation method of the present application is applied can generate a corresponding scene label based on the scene-awareness data collected by the electronic device 100 in real time, and use the scene label as the source language in the translation process. A part of the text to be translated is embedded for translation encoding and translation decoding, and finally a short text translation result conforming to the scene where the electronic device 100 is located is obtained.

Compared with the traditional method of improving the accuracy of short text translation by adding additional information in the decoding stage of machine translation instead of context information, the solution of the present application can fully integrate scene perception data, so that the generated scene tags can participate in the encoding stage of machine translation throughout the process. Compared with the additional information added by the existing technology, the scene labels fused by this solution are more diversified, and the accuracy rate of scene labels is higher and will not be caused by a certain scene perception. The lack of data (for example, the lack of data caused by factors such as a certain sensing element on the electronic device 100 not being turned on) affects the judgment of the scene label. In some embodiments, if a certain scene sensing data is missing, other scene sensing can be supplemented in time. The data is replaced to generate accurate scene labels. Therefore, the scene characteristics information contained in the scene labels fused by this solution is of higher quality and the embedding of scene characteristics information is more flexible. The problem of large noise in the extracted information also improves the user experience accordingly.

It will be understood that, in this application, electronic device 100 includes, but is not limited to, laptop computers, tablet computers, cell phones, wearable devices, head mounted displays, servers, mobile email devices, portable game consoles, portable music players, A reader device, a television with one or more processors embedded or coupled therein, or other disruptive electronic device capable of accessing a network. The electronic device 100 can collect scene perception data through its own sensors, a global positioning system (Global Positioning System, GPS), a camera, etc., and the electronic device 100 can also be used to train a classifier so that it can generate scene labels based on the scene perception data.

It will be appreciated that electronic device 200 includes, but is not limited to, clouds, servers, laptops, desktops, tablet computers, and other electronic devices capable of accessing a network with one or more processors embedded or coupled therein.

For convenience of description, the technical solutions of the present application are described in detail below by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as an example.

The specific process of the solution of the present application will be described in detail below with reference to FIG. 3 . As shown in Figure 3, the fusion scene-aware machine translation method of the present application includes the following steps:

301 : The mobile phone 100 obtains the text to be translated, scene perception data, and obtains scene state data based on the scene perception data.

The acquisition method of the text to be translated includes directly inputting the text to be translated through the input interface of the mobile phone 100, or it can be obtained by taking pictures or videos of the mobile phone 100. The text to be translated can also be the text data obtained by the user's voice command recognition and transformation, which is not described here. make restrictions.

For example, after the mobile phone 100 obtains a photo or video by taking a photo, the mobile phone 100 extracts the text information in the photograph or the image captured from the video through its own image recognition system, and converts it into text to be translated.

The mobile phone 100 obtains the voice command issued by the user, for example, the user can send the voice command to the mobile phone 100 by waking up the voice assistant, and the mobile phone 100 recognizes the text information in the user's voice command through its own man-machine dialogue system, and converts it into the text information to be waiting. Translate text.

The acquisition method of the scene perception data includes data such as images and sounds collected by various detection elements such as a camera, a microphone, an infrared sensor or a depth sensor of the mobile phone 100 .

FIG. 4 is a schematic diagram of a data transformation process in the fusion scene-aware machine translation method of the present application. As shown in FIG. 4 , the mobile phone 100 can acquire scene perception data through microphone, gyroscope, acceleration sensor, GPS, and computer vision (Computer Viewer, CV). The detection sensor collects health status data (for example, PPG sensor collects heart rate data, blood oxygen detection sensor collects blood oxygen data, etc.) or collects pedometer data through a wristband or watch, as one of the scene perception data, which is not limited here.

In addition, there may be one or more ways to acquire the perception data of the same scene. For example, the location information may be acquired through the above-mentioned GPS, or the location information may be acquired by the method of acquiring the location information through a wifi signal, which is not limited here.

Further, as shown in FIG. 4 , the mobile phone 100 can analyze and obtain scene state data based on the collected scene perception data. In the prior art, one or more judgment rules for scene state data may be preset in the mobile phone 100 .

As an example of the determination rule, the mobile phone 100 may determine whether the mobile phone 100 is indoors or outdoors according to the noise type or noise level. For example, the mobile phone 100 may set the noise with a noise level in the range of 2 to 4 as the range of indoor noise, and set the range of 4 to 6 as the range of outdoor noise. When the sound picked up by the microphone of the mobile phone 100 is identified as level 2 noise, the corresponding scene state data obtained by the analysis of the mobile phone 100 is indoor noise. When the sound picked up by the microphone of the mobile phone 100 is identified as level 5 noise, the corresponding scene state data obtained by the analysis of the mobile phone 100 is outdoor noise. The mobile phone 100 can set construction noise (such as construction noise generated by road construction equipment) and traffic noise (such as car whistle, car engine sound, tire friction sound, etc.) as outdoor noise, and set up public areas such as shopping malls, airports, and platforms. The sound played in the venue (such as notification playback, music playback, etc.) is indoor noise, and some living noises (such as the sound of playing mahjong, and noise from other entertainment venues, etc.) can be set as indoor noise. The above noise types are mainly based on different sounds. The frequency of the noise and the voiceprint are different for identification. In some embodiments, the mobile phone 100 may also simply set the frequency range of the sound as the basis for judging the type of noise, which is not limited herein. Therefore, the mobile phone 100 can obtain scene state data, whether it is an indoor scene or an outdoor scene, by analyzing the noise type or noise level of the sound (noise) collected by the microphone.

As an example of the determination rule, the mobile phone 100 can determine whether the location of the current scene is a shopping mall or a station based on the GPS location information and the data of the online map; the mobile phone 100 can also photograph the target (text or picture) to be translated based on the GPS location data and the CV. etc., for example, the menu is shot) The scene state data obtained by the analysis is a restaurant or a restaurant in a shopping mall.

In other embodiments, as an example of the determination rule, the mobile phone 100 can also analyze the running and running based on its own gyroscope, acceleration sensor, position data measured by GPS, and heart rate data measured by a wearable device connected to the mobile phone 100, such as a watch. Scene status data such as riding status, pedometer data, and motion trajectories.

In other embodiments, as an example of the determination rule, the mobile phone 100 may also analyze the vehicle scene currently taken by the user based on the images collected by the CV. For example, the images collected by the CV are the seats on the subway and the screen interface of the subway station announcement screen. , bus seats, site information icons posted in the bus, etc., the mobile phone 100 can determine that the scene where the user is located is the scene of taking the subway or taking the bus.

It can be understood that, when analyzing the scene state data, if a certain scene perception data is missing, the mobile phone 100 can call other scene perception data to replace the missing scene perception data to determine the scene state data. For example, when the GPS of the mobile phone 100 is not turned on, the mobile phone 100 cannot collect location data, and the mobile phone 100 can obtain scene state data by analyzing the sound collected by the microphone and the environmental characteristics collected by the infrared sensor.

The scene state data can be obtained by analyzing one or more kinds of scene perception data. Generally, if it is a simple and easily distinguishable scene, the mobile phone 100 can determine its scene state data based on less scene perception data. For example, for a station or an airport scene, The mobile phone 100 may only need to integrate the location information collected by the GPS, or the type of sound in the station collected by the microphone, etc., to determine the state of the basic scene. If it is a scene that is relatively complicated and difficult to distinguish, the mobile phone 100 may need to integrate various sensing data to comprehensively judge the state of the basic scene, which is not limited here.

It can be understood that the above-mentioned different ways of acquiring the text to be translated may correspond to different scene perception data and ways of acquiring the scene perception data, which are not limited here.

For example, when the mobile phone 100 obtains the text to be translated by taking pictures or videos, the method of scene perception data obtained by the mobile phone 100 may include scene feature image data (for example, image data collected by CV, etc.), location data (for example, collected by mobile phone gyroscope) angular motion data, location data collected by mobile phone GPS, etc.); sound data (for example, sound data collected through a microphone, etc.), and so on.

When manually inputting the text to be translated, in this case the camera of the mobile phone 100 does not need to be turned on and thus is not turned on, the mobile phone 100 may not collect image data through CV, and the way of scene perception data obtained by the mobile phone 100 may include location data (for example, Gather gyroscope to measure and collect angular motion data, collect position data through GPS, etc.); sound data (such as collecting sound data through microphone, etc.), motion data (such as collecting heart rate data through smart watch, smart bracelet, collecting acceleration data through acceleration sensor) etc.), environmental data (eg, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.), and so on. It can be understood that, in some scenarios, although there is no shooting interface on the interface of the mobile phone 100, the camera of the mobile phone 100 can also work in the background to collect CV signals.

When acquiring the text to be translated through voice commands, in order to prevent mutual interference of sound data and acquire scene perception data, in this case, the camera of the mobile phone 100 does not need to be turned on, and the mobile phone 100 does not need to collect image data through CV. The method of data may include location data (such as collecting angular motion data measured by gyroscope, collecting location data through GPS, etc.); user's movement state data (such as collecting heart rate data, blood oxygen data, etc. through smart watches, smart bracelets, etc.) , environmental data (for example, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.), and so on. It can be understood that, in some scenarios, although there is no shooting interface on the interface of the mobile phone 100, the camera of the mobile phone 100 can also work in the background to collect CV signals.

It can be understood that the device that acquires the scene perception data and the device that determines the basic scene state may be the same electronic device (for example, the mobile phone 100 may not only collect the scene perception data but also directly analyze the scene state data), or may be different electronic devices, For example, the scene perception data collected by the mobile phone 100 can be sent to the server 200 for further analysis to obtain the scene state data; or the scene perception data can be collected through smart wearable devices such as watches and wristbands, and sent to the mobile phone 100 for further analysis to obtain the scene state data. make restrictions.

302: The mobile phone 100 generates a scene label based on the obtained scene state data.

Specifically, the mobile phone 100 classifies and labels the scene state data obtained by the above analysis, and different scene state data may correspond to the same scene label. Therefore, it can be understood that the correspondence between the scene state data and the scene label is a many-to-one or one-to-one correspondence.

The generation of the scene label by the mobile phone 100 based on the scene state data may be accomplished through a pre-trained scene classifier. For example, the mobile phone 100 can perform classification training by inputting the scene state data into a Gradient Boosting Decision Tree (GBDT) classifier, and label the scene state data classified in the same or similar categories with the same scene label. Among them, it can be understood that the scene state data can be sample scene state data specially collected for training the classifier, or it can be the scene state data analyzed in the actual application of machine translation, and the scene state data can be accumulated over time to form The scene state database, and the corresponding scene tags can also be accumulated over time to form a scene tag library. Since the storage space occupied by the classifier algorithm is relatively small, the training of the classifier can be performed on an electronic device such as the mobile phone 100, or the training can be completed on the server 200, which is not limited here.

The above GBDT classifier is a classifier that applies the GBDT algorithm. The GBDT algorithm is one of the best algorithms for fitting real distributions in traditional machine learning algorithms. It can be used for both classification and regression. Filter features. The principle of the GBDT algorithm is to generate a weak classifier through multiple rounds of iterations, and each classifier is trained on the basis of the residual of the previous round of classifiers. The requirements for weak classifiers are generally simple enough and low variance and high bias. Because the training process is to continuously improve the accuracy of the final classifier by reducing the bias. The decision tree used by the GBDT algorithm is the CART regression tree. During training, the GBDT classifier can classify the scene state data, and then manually label the scene state data in the same or similar categories through the mobile phone 100 or the server 200. The same scene label is obtained after training with a large number of scene state sample data. A number of scene tags form the scene tag database. For example, the scene state data that the mobile phone 100 can obtain based on GPS data is a well-known restaurant or a shopping mall, the scene state data (such as indoor noise) that can be obtained based on the noise type of the sound collected by the microphone, and the photo of the target item taken based on CV. And the collected photos of the surrounding environment can be analyzed to obtain that the scene state data is a menu. By combining the above scene state data, it can be determined that the current scene is a menu translation scene in restaurant dining, so “restaurant” or “restaurant” can be marked as the scene label.

It can be understood that the mobile phone 100 or the server 200 can also train other classification algorithm models so that they can generate scene labels corresponding to the scene state data. Other classification algorithms include but are not limited to support vector machine (Support Vector Machine, SVM) algorithms, logistic regression (Logistic Regress) , LR) algorithm, Iterative Dichotomiser 3 (ID3) decision tree algorithm, etc., which are not limited here.

303 : The mobile phone 100 encodes the above-mentioned scene tag and the text to be translated together as a source language sequence through an encoder in the NMT-Transformer, and extracts the scene and to-be-translated text information in the source language sequence.

Specifically, the scene label and the text to be translated are input into the encoder of the NMT-Transformer (the encoding layer in the Transformer network) for encoding, wherein the encoding layer in the Transformer network is implemented by a multi-layer self-attention network. , where the attention vector output by each layer of self-attention network will be used as the input of the next layer of self-attention network.

In the process of inputting the scene label and the text to be translated into the encoder for encoding, two parts need to be considered for how to embed the scene label into the text to be translated: one is to consider the correlation between the content of the scene label and the type of text in the text to be translated The second is to consider that the position where the scene label is embedded in the text to be translated should be closer to the text content with higher relevance.

FIG. 5 shows a schematic diagram of the process of embedding scene tags in the text to be translated in an encoding process. As shown in Figure 5, for example, in a menu translation scenario, X ₅ , X ₆ , X ₇ , and X ₈ represent words that constitute the scene label, and X ₉ , X ₁₀ , X ₁₁ , and X ₁₂ represent words that constitute the to-be-translated word for text. Among them, the text to be translated is the text data obtained by converting the photographed menu, and the scene label is a restaurant or restaurant. Among them, the dish name in the menu has a higher correlation with the scene label, while the price of the dish has a low or no correlation with the scene label. Therefore, when inputting the scene label and the text to be translated into the Transformer network, the input of the scene label should be placed before the input of the dish name text, so that the scene label and the dish name text are closer. For example, each line of text on the menu is a dish Name + price + description of the dish, so you can enter the scene label "restaurant" before each line of text on the menu is entered into the Transformer network. It can be understood that the closer the distance to the scene label to the text to be translated, the more scene information the Transformer network extracts from the scene label while encoding and extracting text information, that is, the greater the attention of the scene label to the text, as shown in the figure As shown in 5, it is represented by a strong correlation curve; on the contrary, the farther the text to be translated is from the scene label, the less scene information the Transformer network extracts from the scene label while encoding and extracting the text information, that is, the scene label is the same. The smaller the attention of the text, is represented by the weak correlation curve.

For example, the text data BATTERED WHITING 16.0M|19.0NM on the menu, and restaurant is input before it, the source language sequence of the input Transformer network is: Restaurant BATTERED WHITING 16.0M|19.0NM. Among them, the distance between Restaurant and BATTERED WHITING is closer. Restaurant pays more attention to BATTERED WHITING and has a greater impact on its translation results. Under the influence of Restaurant, BATTERED WHITING is translated as fried cod instead of before The wrong translation is the battered cod; while the distance between Restaurant and 16.0M|19.0NM is far away, the attention is small, and the impact on its translation result is also small, and the translation result remains unchanged.

304 : The mobile phone 100 decodes the scene information and the to-be-translated text information extracted from the scene label and the to-be-translated text in the encoding phase by using the decoder in the NMT-Transformer to decode the translated text expressed in the target language by word, and outputs the translation result.

Specifically, the decoder (the decoding layer in the Transformer network) selects the target language for decoding based on the scene information extracted by the encoder during the encoding process to obtain the translation. Among them, the decoding layer in the Transformer network is also implemented by a multi-layer self-attention network.

It can be understood that since in the encoding stage of NMT-Transformer, the Transformer network has already extracted the scene information in the embedded scene label while extracting the text information to be translated. Therefore, in the decoding stage of NMT-Transformer, the Transformer network can be directly based on the scene label. The scene information of the target language is selected for decoding, so as to obtain a translation that is more in line with the scene.

FIG. 6 is a schematic diagram of interface comparison of a scene-based short text translation result according to the present application. Among them, as shown in Figure 6(a), it is the translation result interface of a traditional translation device or translation equipment. The final decoded translation of the dish name Fisherman's Basket to be translated is: Fisherman's Basket, which is obviously a wrong translation result; As shown in FIG. 6(b), for the translation result interface of the translation device using the fusion scene-aware machine translation method of the present application, based on the scene information extracted from the scene tag (restaurant), the translation obtained by decoding the dish Fisherman's Basket is a seafood platter , the translation result is correct.

Another implementation scenario is described below with reference to FIG. 7 .

FIG. 7 is a schematic diagram of interface comparison of another scene-based short text translation result according to the present application. As shown in Figure 7, it is a display of the translation result of the entry card filled in the entry and exit scene. Among them, the text to be translated shown in Figure 7(a) is the source language of the entry card (customs declaration card); Figure 7(b) shows the translation result of the traditional translation device; Figure 7(c) shows the application of this The translation result of the translation device that applies the fusion scene-aware machine translation method.

In conjunction with Fig. 3 and its related description, the above-mentioned process of applying the fusion scene-aware machine translation method of the present application to obtain the translation result shown in Fig. 7(c) includes the following steps:

S1: Acquire the text to be translated, scene perception data, and obtain scene state data based on the scene perception data.

Specifically, on the one hand, by turning on the camera of the mobile phone 100 to photograph the page of the entry-exit card, the mobile phone 100 extracts the text to be translated on the photographed entry-exit card through its own image recognition system.

On the other hand, the mobile phone 100 collects position data through GPS, sound data through microphone, and environmental characteristic image data through CV as scene perception data. Based on the collected scene perception data, the scene status data is correspondingly obtained. For example, the location data collected by GPS is used to determine whether the location or the nearby marked geographic location mark is an airport or a customs, etc., and the sound data collected by the microphone is used to determine whether the environment is an indoor environment. It is still an outdoor environment, and the CV collects environmental characteristic image data to determine whether there is a registration window, registration form, etc. in the environment. If the use of CV is restricted in some entry and exit scenarios, it is also possible not to collect image data through CV as scene perception data, and to obtain scene state data through other scene perception data. The quantity and type of scene perception data collected by the mobile phone 100 are not limited herein. Specific reference is made to the foregoing step 301 and related descriptions, which will not be repeated here.

S2: Based on the scene state data obtained in the above S1, the scene label is generated as entry and exit. The classifier that has been trained in the mobile phone 100 can quickly generate entry and exit scene labels directly according to the scene state data obtained above. For details, refer to the above step 302 and related descriptions, which will not be repeated here.

S3: The mobile phone 100 uses the encoder in the NMT-Transformer to encode the above-generated entry-exit scene label and the text to be translated together as a source language sequence, and extracts entry-exit scene information and to-be-translated text information in the source language sequence. For the specific encoding process, refer to the above-mentioned step 303 and related descriptions, which will not be repeated here.

S4: The mobile phone 100 uses the decoder in the NMT-Transformer to decode the entry-exit scene information and the to-be-translated text information extracted from the entry-exit scene label and the to-be-translated text in the encoding phase to decode the translated text expressed in the target language by word, and output the translation result , as shown in Figure 7(c). For the specific decoding process, refer to the above-mentioned step 304 and related descriptions, which will not be repeated here.

In the translation result shown in Figure 7(c), there are some professional terms on the immigration card, for example, Please print in capital letters, which is accurately translated as please fill in with uppercase characters, and print is correctly translated as fill in.

In contrast, in the traditional translation result shown in Figure 7(b), it is obviously wrong to translate the above sentence Please print in capital letters. Therefore, the translation result after incorporating scene labels (entry and exit scenes) is more accurate and the user experience is better.

As mentioned above, in practical applications, the mobile phone 100 can be embedded with an application to which the fusion scene-aware machine translation method of the present application is applied to achieve accurate translation of scene-based short texts; application software can also be installed on the mobile phone 100, The interaction with the server 200 sends the text to be translated to the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method; The interaction with the server 200 sends the text to be translated to the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method, which is not limited here.

An exemplary structure of the electronic device 100 is given below in conjunction with the embodiments of the present application.

FIG. 8 shows a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.

The mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user Identity module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal Image signal processor (ISP), controller, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.

In the embodiment of the present application, the mobile phone 100 can train the scene classifier and the encoder and decoder of the NMT-Transformer through the processor 110, and during the actual scene-based short text translation The acquired scene perception data and the text to be translated are processed and the fusion scene perception machine translation method described in the above steps 301 to 304 is executed. In some embodiments, the processor 110 may include one or more interfaces.

It can be understood that, the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The charging management module 140 is used to receive charging input from the charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130 . In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through the wireless charging coil of the mobile phone 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .

The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140 to supply power to the processor 110 , the internal memory 121 , the display screen 194 , the camera 193 , and the wireless communication module 160 . The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.

The wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like. The mobile phone 100 implements communication and data transmission with the server 200 through the above-mentioned wireless communication function.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone 100 . The wireless communication module 160 can provide applications on the mobile phone 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.

In some embodiments, the antenna 1 of the mobile phone 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone 100 can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (GLONASS), a Beidou navigation satellite system (BDS), a quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).

The mobile phone 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information. The image or text collected by the mobile phone 100 for the text to be translated of the scene-based short text is displayed on the display screen 194, and the translation result of the text to be translated by the mobile phone 100 is also displayed on the display screen 194 for feedback to the user.

Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The mobile phone 100 may include one or N display screens 194 , where N is a positive integer greater than one.

The SIM card interface 195 is used to connect a SIM card.

The mobile phone 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor. The collection of the CV signal by the mobile phone 100 can also be realized by the above-mentioned shooting function, that is, the image of the scene or the image of the text to be translated is collected by the above-mentioned shooting function.

Camera 193 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the mobile phone 100 may include one or N cameras 193 , where N is a positive integer greater than one.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 . Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor. In the embodiment of the present application, the processor 110 executes the fusion scene-aware machine translation method of the present application by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.

The mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The mobile phone 100 may be provided with at least one microphone 170C. In other embodiments, the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions. During the implementation of the present application, the microphone 170C can collect sound signals and determine the noise level or noise type of the collected sound signals, so as to further analyze the scene state data, such as whether it is indoors or outdoors.

The earphone jack 170D is used to connect wired earphones.

The gyroscope sensor 180B can be used to determine the motion attitude of the mobile phone 100 . In some embodiments, the angular velocity of cell phone 100 about three axes (ie, x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyroscope sensor 180B detects the shaking angle of the mobile phone 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the mobile phone 100 through reverse motion to realize anti-shake. The gyro sensor 180B can also be used for navigation and somatosensory game scenarios.

The acceleration sensor 180E can detect the magnitude of the acceleration of the mobile phone 100 in various directions (generally three axes). When the mobile phone 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to recognize the posture of mobile phones, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc. During the implementation of the present application, certain scene state data, such as the user's walking, running and riding state, can be obtained by analyzing the shaking state data measured by the gyro sensor 180B and the acceleration data measured by the acceleration sensor 180E.

Distance sensor 180F for measuring distance. The cell phone 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the mobile phone 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.

The ambient light sensor 180L is used to sense ambient light brightness. The mobile phone 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the mobile phone 100 is in the pocket, so as to prevent accidental touch. In the implementation process of the present application, the scene state data may be analyzed based on the ambient light brightness sensed by the ambient light sensor 180L, for example, to determine whether the current scene is indoor or outdoor.

The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The cell phone 100 can receive key input and generate key signal input related to user settings and function control of the cell phone 100 .

The software system of the mobile phone 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present invention take an Android system with a layered architecture as an example to illustrate the software structure of the mobile phone 100 as an example.

FIG. 9 is a block diagram of a software structure of a mobile phone 100 according to an embodiment of the present invention.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.

The application layer can include a series of application packages.

As shown in Figure 9, the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 9, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.

A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.

Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide the communication function of the mobile phone 100 . For example, the management of call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.

The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the mobile phone 100 vibrates, and the indicator light flashes.

Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.

The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.

The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.

A system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.

The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc. The 3D graphics processing library is used to realize 3D graphics drawing, image rendering, compositing and layer processing, etc. 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.

The workflow of the software and hardware of the mobile phone 100 is exemplarily described below with reference to the menu translation scenario.

When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes touch operations into original input events (including opening translation software, or opening the camera 193 and other operations). Raw input events are stored at the kernel layer. The application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, as an example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer, and then starts the camera driver by calling the kernel layer. The camera 193 captures a still image or video of the menu to be translated.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one example embodiment or technique disclosed in accordance with this application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

The present disclosure also relates to apparatuses for performing operations in text. This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.

The processes and displays presented herein are not inherently related to any specific computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform one or more method steps. Architectures for various of these systems are discussed in the following description. Additionally, any specific programming language sufficient to implement the techniques and embodiments disclosed herein may be used. Various programming languages may be used to implement the present disclosure, as discussed herein.

Additionally, the language used in this specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or limit the disclosed subject matter. Accordingly, the present disclosure is intended to illustrate, but not to limit, the scope of the concepts discussed herein.

Claims

A fusion scene-aware machine translation method for an electronic device with a machine translation function, characterized in that the method includes:

acquiring the text to be translated and scene perception data, the scene perception data being collected by the electronic device and used to determine the scene where the electronic device is located;

determining the scene in which the electronic device is located according to the scene perception data;

generating a scene tag corresponding to the scene based on the scene where the electronic device is located;

Inputting the scene tag and the text to be translated together as a source language sequence into an encoder for translation for encoding, to obtain encoded data of fusion scene perception;

The encoded data of the fusion scene perception is decoded and the target language is converted by the decoder for translation, and the translation result of fusion scene perception is obtained.
The method according to claim 1, wherein, determining the scene in which the electronic device is located according to the scene perception data, comprising:

The characteristics of the scene in which the electronic device is located are determined according to the scene perception data to obtain scene state data, where the scene state data is used to represent the scene in which the electronic device is located; and

Classify statistics on the scene state data to determine the scene in which the electronic device is located.
The method according to claim 2, wherein the scene perception data is collected by a detection element provided in the electronic device, and the detection element comprises at least one of a GPS element, a camera, a microphone, and a sensor.
The method according to claim 3, wherein the scene perception data comprises one or more of position data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data and angular motion data .
The method according to claim 4, wherein the characteristics of the scene in which the electronic device is located are determined according to the scene perception data, including one or more of the following situations:

determining the location name of the scene according to the location data;

According to one or more of the text in the image data and the target, determine the characteristic text or characteristic objects in the scene, and determine the environmental characteristics of the scene;

Determine the noise type or noise level in the scene according to one or more of the frequency, voiceprint, and amplitude in the sound data, and determine that the scene belongs to indoors or outdoors;

determining the motion state of the electronic device in the scene according to the acceleration data and the angular motion data;

According to the ambient temperature data and the ambient light intensity data, the temperature level and the light intensity level of the scene are determined, and it is determined that the scene belongs to indoors or outdoors.
The method according to claim 5, wherein determining the scene in which the electronic device is located according to the scene perception data, further comprising:

Determine a user motion state according to the scene perception data, where the user motion state is used to determine the characteristics of the scene;

Wherein, the scene perception data includes one or more of heart rate data and blood oxygen data.
The method according to claim 6, wherein the order in which the scene tags and the text to be translated are input into the encoder is based on the degree of correlation between the text content in the text to be translated and the scene tags Sure;

The greater the correlation between the text content in the text to be translated and the scene tag, the closer the input distance between the text content in the text to be translated and the scene tag is.
The method according to claim 7, wherein the encoded data of the fusion scene perception comprises scene feature information in the scene label extracted by the encoder during the encoding process and text content in the to-be-translated text information, and

The encoder extracts the scene feature information and the text content information according to the sequence in which the scene label and the text to be translated are input into the encoder.
The method according to claim 8, wherein the decoder selects a word corresponding to the text content information in the target language based on the scene feature information, and generates the translation result of the fusion scene perception .
The method according to claim 9, wherein the generating of the scene label based on the scene state data is implemented by a classifier, and the encoder and the decoder are implemented by a neural network model.
The method according to claim 10, wherein the classifier implements the classification calculation of the scene state data through a classification algorithm, and the classification algorithm includes a gradient boosting tree classification algorithm, a support vector machine algorithm, and a logistic regression algorithm , any one of the three generation decision tree algorithms of iterative binary tree;

The neural network model includes a cyclic neural network machine translation model based on Transformer network.
A readable medium, characterized in that an instruction is stored on the readable medium, and when the instruction is executed on an electronic device, the electronic device executes the method according to any one of claims 1 to 11 .
An electronic device, comprising:

memory for storing instructions for execution by one or more processors of the electronic device, and

The processor, which is one of the processors of the electronic device, is used to execute the method of any one of claims 1 to 11 .