WO2022073417A1 - Fusion scene perception machine translation method, storage medium, and electronic device - Google Patents

Fusion scene perception machine translation method, storage medium, and electronic device Download PDF

Info

Publication number
WO2022073417A1
WO2022073417A1 PCT/CN2021/119655 CN2021119655W WO2022073417A1 WO 2022073417 A1 WO2022073417 A1 WO 2022073417A1 CN 2021119655 W CN2021119655 W CN 2021119655W WO 2022073417 A1 WO2022073417 A1 WO 2022073417A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
data
text
electronic device
translation
Prior art date
Application number
PCT/CN2021/119655
Other languages
French (fr)
Chinese (zh)
Inventor
徐传飞
潘邵武
王成录
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022073417A1 publication Critical patent/WO2022073417A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the technical field of neural network machine translation, in particular to a fusion scene perception machine translation method, a storage medium and an electronic device.
  • NMT Neuro Machine Translation
  • the development process of NMT consists of two stages: the first stage (2014-2017) is based on the Recurrent Neural Network (RNN) NMT, and its core network architecture is an RNN; the second stage (2017-present) is based on The NMT of the Transformer neural network (hereinafter referred to as NMT-Transformer), its core network architecture is a Transformer model.
  • NMT-Transformer the Transformer neural network
  • the current mainstream translation equipment or products based on NMT-Transformer face the same problem: the translation accuracy is low when dealing with scene-based short text, mainly because the scene-based short text is usually composed of several words or characters, and its meaning is the same as the
  • the contextual information of short contextual texts is lacking, resulting in inaccurate translation by NMT-Transformer.
  • the embodiments of the present application provide a fusion scene-aware machine translation method, a storage medium, and an electronic device, where a scene label is generated based on scene-aware data collected by the electronic device, and the generated scene label and the text to be translated are used together as a source language sequence in the
  • the encoding stage of the Transformer network performs fusion coding and extracts the information in the source language sequence.
  • the decoding stage of the Transformer network the information in the source language sequence is converted into the target language, and the translation result conforming to the text scene to be translated is obtained by decoding. Greatly improves the translation accuracy of scene-based short texts.
  • an embodiment of the present application provides a fusion scene-aware machine translation method for an electronic device with a machine translation function, including: acquiring text to be translated and scene-awareness data, where the scene-awareness data is stored by the electronic device collected and used to determine the scene where the electronic device is located; determine the scene where the electronic device is located according to the scene perception data; generate a scene corresponding to the scene based on the scene where the electronic device is located label; inputting the scene label and the text to be translated together as a source language sequence into an encoder for translation, to obtain encoded data of fusion scene perception; passing the encoded data of fusion scene perception through a decoder for translation Decoding and target language conversion are performed to obtain translation results that incorporate scene perception.
  • a mobile phone with a machine translation function can collect scene perception data, and the collected scene perception data can be used to determine the scene where the mobile phone is currently located, and then can generate a scene label corresponding to the scene where the mobile phone is currently located.
  • the mobile phone can fuse the generated scene tags to encode and decode machine translation, and then obtain the translation result of fusion scene perception.
  • the above-mentioned method further includes: determining the characteristics of the scene in which the electronic device is located according to the scene perception data, so as to obtain scene state data, where the scene state data is used to represent The scene in which the electronic device is located; and the scene state data is classified and counted to determine the scene in which the electronic device is located.
  • a mobile phone with a machine translation function can determine the characteristics of the current scene based on the collected scene perception data, and record the determined scene characteristics as scene state information.
  • the above obtained scene state data is classified and counted to obtain scene types including one or more scene state information, and each scene type corresponds to a scene in which the mobile phone is located.
  • the method further includes: the scene perception data is acquired by a detection element set in the electronic device, the detection element includes a GPS element, a camera, a microphone, a sensor, etc. at least one of.
  • the scene perception data includes one or more of position data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data, and angular motion data.
  • the mobile phone can continuously collect the current location data of the mobile phone through its own GPS element, the surrounding sound data through its own microphone, and the temperature data and light intensity data of the current scene environment through its own temperature sensor and ambient light sensor, and
  • the mobile phone can collect the current angular motion data of the mobile phone through the gyroscope.
  • the mobile phone can also collect the feature image data of the surrounding environment and the image data of the text to be translated through the camera. Some of these scene perception data collected by the mobile phone may be used further. is invalid, and most of the scene perception data is valid data when determining the scene where the mobile phone is located, and can be used to determine the scene where the mobile phone is located.
  • the characteristics of the scene in which the electronic device is located is determined according to the scene perception data, including one or more of the following situations: determining the scene according to the position data The location name of the scene; according to one or more of the words in the image data and the object, determine the characteristic words or characteristic objects in the scene, and determine the environmental characteristics of the scene; according to the sound data One or more of the frequency, voiceprint, and amplitude of the scene to determine the noise type or noise level in the scene, and determine that the scene belongs to indoor or outdoor; according to the acceleration data and the angular motion data, determine The motion state of the electronic device in the scene; according to the ambient temperature data and the ambient light intensity data, determine the temperature level and light intensity level of the scene, and determine whether the scene belongs to indoors or outdoors.
  • the mobile phone can determine the location name of the scene in which it is located, such as a shopping mall or an airport, through location data.
  • the mobile phone can also determine the features in the scene through the image data collected by the camera. For example, if the collected image data is subway seats and subway station information, it can be determined that the scene is the scene of taking the subway. If the text to be translated is from the subway Site information, you can use the mobile phone to integrate the scene of taking the subway for scene-based short text translation.
  • the mobile phone can also determine whether it is indoors or outdoors through the collected sound data, and some sound data can also preliminarily determine the scene. Make sure you are indoors and possibly playing mahjong.
  • the acceleration data collected by the mobile phone through the acceleration sensor and the angular motion data collected by the gyroscope can be used to determine the current state of motion. For example, the acceleration data in the subway and bus scenarios are different, and the vehicle you are taking can be determined according to the acceleration data. .
  • the mobile phone can also collect ambient temperature data or ambient light intensity data through the ambient temperature sensor and ambient light sensor to determine whether the scene it is in belongs to indoors or outdoors. Generally, the indoor temperature is lower than the outdoor temperature in summer, and the indoor temperature is higher than the outdoor temperature in winter. The light intensity is lower than the outdoor light intensity, and the indoor light intensity is higher than the outdoor light intensity at night.
  • the method further includes: determining a user motion state according to the scene perception data, where the user motion state is used to determine the characteristics of the scene; wherein the scene perception data Including one or more of heart rate data and blood oxygen data.
  • the mobile phone can obtain the user's heart rate data, blood oxygen data, etc. by connecting to the wearable device.
  • the user's heart rate data or blood oxygen data can be collected through a smart watch or a wristband to determine whether the user is in a state of exercise.
  • the user is exercising or the amount of exercise increases
  • the heart rate will increase, and the blood oxygen level will also have a large change.
  • the scene where the mobile phone is located can be further determined.
  • the heart rate data will change greatly when the user is exercising in the gym.
  • the mobile phone can be determined based on the change in the heart rate data.
  • the current altitude of the user can also be determined through heart rate data or blood oxygen data.
  • the user can determine the current scene of the user through altitude and location data, so that the mobile phone can realize integrated scene perception Translate contextualized short texts.
  • the above-mentioned method further includes: an order in which the scene label and the text to be translated are input into the encoder based on the text content in the to-be-translated text and the scene label The correlation between the texts to be translated is determined; the greater the correlation between the text content in the to-be-translated text and the scene label, the closer the input distance between the text content in the to-be-translated text and the scene label is.
  • the encoded data of the fusion scene perception includes the scene feature information in the scene label extracted by the encoder during the encoding process and the text content information in the text to be translated, and the encoder is based on the scene label and the scene label.
  • the sequence in which the text to be translated is input to the encoder extracts the scene feature information and the text content information.
  • the scene label "restaurant” can be placed before the dish name in the menu text to be translated and input into the encoder.
  • the extracted scene feature information is closer to the dish name, so the scene feature information has a greater impact on the translation of the dish name.
  • the above-mentioned method further includes: the decoder selects a word corresponding to the text content information in the target language based on the scene feature information, and generates the fusion Scene-aware translation results.
  • the decoder selects words corresponding to the dish name in the target language based on the feature information of the dining scene to form a translation, and finally can obtain an accurate dish name translation result.
  • the above-mentioned method further includes: the generating a scene label based on the scene state data is implemented by a classifier, and the encoder and the decoder are implemented by a neural network model.
  • the classifier realizes the classification calculation of the scene state data through a classification algorithm, and the classification algorithm includes any one of a gradient boosting tree classification algorithm, a support vector machine algorithm, a logistic regression algorithm, and an iterative binary tree 3-generation decision tree algorithm.
  • the neural network model includes a cyclic neural network machine translation model based on Transformer network.
  • the scene state data is classified and counted by the classifier in the mobile phone to determine the scene in which the mobile phone is located, and then the scene label is generated.
  • the mobile phone integrates scene tags to complete the translation of scene-based short texts through the trained NMT-Transformer translation model.
  • an embodiment of the present application provides a readable medium, where an instruction is stored on the readable medium, and the instruction, when executed on an interactive device, causes the electronic device to execute the above-mentioned fusion scene-aware machine translation method.
  • embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, for Perform the fusion scene-aware machine translation method described above.
  • FIG. 1 is a schematic diagram of an application scenario of the fusion scene-aware machine translation method of the present application
  • Fig. 2 is a schematic diagram of an example of wrong translation of the current translation device when translating a scene-based short text
  • FIG. 3 is a schematic diagram of the steps of the fusion scene-aware machine translation method of the present application.
  • FIG. 4 is a schematic diagram of a data conversion process in the fusion scene-aware machine translation method of the application
  • FIG. 5 is a schematic diagram of a process of embedding a scene tag in a text to be translated in an encoding process of the present application
  • FIG. 6 is a schematic diagram of interface comparison of a scene-based short text translation result according to the present application.
  • FIG. 7 is a schematic diagram of interface comparison of another scene-based short text translation result of the application.
  • FIG. 8 is a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.
  • FIG. 9 is a block diagram of a software structure of a mobile phone 100 according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of the fusion scene-aware machine translation method of the present application.
  • the scene includes electronic device 100 and electronic device 200 , wherein electronic device 100 and electronic device 200 are connected through a network and perform data information exchange, and electronic device 100 or electronic device 200 has a machine translation function.
  • the user uses the electronic device 100 to take pictures, shoot a short video, or directly input the text to be translated for translation.
  • the image-to-text conversion function or the speech recognition function converts it into the text data to be translated and then translates it.
  • the electronic device 200 can be used to train an NMT-Transformer to enable it to fuse scene labels for translation encoding and translation decoding, and the electronic device 200 can also be used to train a classifier to enable it to generate scene labels based on scene-aware data.
  • the NMT-Transformer and classifier trained by the electronic device 200 can be transplanted into the electronic device 100 for use.
  • the electronic device 100 can perform translation through its own translation function, or perform data interaction with the electronic device 200 by opening a locally installed translation software or opening an online translated webpage to complete the translation of the text to be translated.
  • the electronic device 100 is a terminal device that interacts with a user, on which is installed application software or an application system capable of executing NMT-Transformer based on Neural Machine Translation (NMT).
  • NMT Neural Machine Translation
  • a man-machine dialogue system may also be installed on the electronic device 100 to recognize a user's voice command requesting to perform a translation function, and the electronic device 100 may have a function of recognizing text on pictures or videos to recognize the pictures or videos as text data for further translation.
  • NMT Neural Machine Translation
  • SMT Statistical Machine Translation
  • NMT can train a neural network that can map from one sequence to another, and the output can be a variable-length sequence. Excellent performance in dialogue and text summarization.
  • NMT is actually an encoder-decoder system, in which the encoder encodes the source language sequence, extracts the information in the source language, and then converts this information to another language through the decoder. language, so as to complete the translation of the language.
  • the current mainstream machine translation method is based on NMT-Transformer, and its core network architecture is a Transformer network, which means that the encoder and decoder functions in NMT-Transformer are implemented through the Transformer network.
  • the core of the Transformer network is the self-attention layer, which is to calculate the self-attention of the vector space. Self-attention can be understood as the degree of correlation. The self-attention between the two vectors is small or 0.
  • the current NMT-Transformer translation method has the problem of inaccurate translation due to lack of context when translating short text in context.
  • the current translation equipment may incorrectly translate "battered whiting" as "battered cod", which is different from the real dish name (the correct translation result should be: fried cod) is very irrelevant. Therefore, in the scene-based short text translation scenario, the existing machine translation methods have low translation accuracy, resulting in poor user experience.
  • Scenario-based short texts include but are not limited to menu names in menus, store names in shopping malls, special terms in immigration cards, etc.
  • the current traditional solution is to add additional information of other dimensions in the decoding stage or post-processing stage of machine translation to replace the context information of the text to improve the translation accuracy of short texts, but the essence of this solution is The above is a secondary correction of the translation result. Since the additional information added is relatively simple and the matching accuracy with the text to be translated is not high, the translation accuracy of the short text is still relatively low; and this way of adding additional information will lead to When decoding and extracting information, the noise is large and cannot be directly used in the Transformer network, so it is not suitable for the current mainstream NMT-Transformer.
  • the present application provides a fusion scene-aware machine translation method.
  • NMT-Transformer based on scene-awareness data collected by electronic equipment (for example, The noise data of the scene is used to generate the scene label with the picture data to be translated in the scene, etc.), and then the generated scene label and the text to be translated are used together as the source language sequence.
  • the information in the source language sequence is converted into the target language, and the translation result conforming to the text scene to be translated is obtained by decoding.
  • the present application greatly improves the translation accuracy of scene-based short texts by incorporating scene tags into the text to be translated.
  • the electronic device 100 or the electronic device 200 to which the fusion scene-aware machine translation method of the present application is applied can generate a corresponding scene label based on the scene-awareness data collected by the electronic device 100 in real time, and use the scene label as the source language in the translation process.
  • a part of the text to be translated is embedded for translation encoding and translation decoding, and finally a short text translation result conforming to the scene where the electronic device 100 is located is obtained.
  • the solution of the present application can fully integrate scene perception data, so that the generated scene tags can participate in the encoding stage of machine translation throughout the process.
  • the scene labels fused by this solution are more diversified, and the accuracy rate of scene labels is higher and will not be caused by a certain scene perception.
  • the lack of data affects the judgment of the scene label.
  • a certain scene sensing data is missing, other scene sensing can be supplemented in time.
  • the data is replaced to generate accurate scene labels. Therefore, the scene characteristics information contained in the scene labels fused by this solution is of higher quality and the embedding of scene characteristics information is more flexible.
  • the problem of large noise in the extracted information also improves the user experience accordingly.
  • electronic device 100 includes, but is not limited to, laptop computers, tablet computers, cell phones, wearable devices, head mounted displays, servers, mobile email devices, portable game consoles, portable music players, A reader device, a television with one or more processors embedded or coupled therein, or other disruptive electronic device capable of accessing a network.
  • the electronic device 100 can collect scene perception data through its own sensors, a global positioning system (Global Positioning System, GPS), a camera, etc., and the electronic device 100 can also be used to train a classifier so that it can generate scene labels based on the scene perception data.
  • GPS Global Positioning System
  • electronic device 200 includes, but is not limited to, clouds, servers, laptops, desktops, tablet computers, and other electronic devices capable of accessing a network with one or more processors embedded or coupled therein.
  • the technical solutions of the present application are described in detail below by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as an example.
  • the fusion scene-aware machine translation method of the present application includes the following steps:
  • the mobile phone 100 obtains the text to be translated, scene perception data, and obtains scene state data based on the scene perception data.
  • the acquisition method of the text to be translated includes directly inputting the text to be translated through the input interface of the mobile phone 100, or it can be obtained by taking pictures or videos of the mobile phone 100.
  • the text to be translated can also be the text data obtained by the user's voice command recognition and transformation, which is not described here. make restrictions.
  • the mobile phone 100 extracts the text information in the photograph or the image captured from the video through its own image recognition system, and converts it into text to be translated.
  • the mobile phone 100 obtains the voice command issued by the user, for example, the user can send the voice command to the mobile phone 100 by waking up the voice assistant, and the mobile phone 100 recognizes the text information in the user's voice command through its own man-machine dialogue system, and converts it into the text information to be waiting. Translate text.
  • the acquisition method of the scene perception data includes data such as images and sounds collected by various detection elements such as a camera, a microphone, an infrared sensor or a depth sensor of the mobile phone 100 .
  • FIG. 4 is a schematic diagram of a data transformation process in the fusion scene-aware machine translation method of the present application.
  • the mobile phone 100 can acquire scene perception data through microphone, gyroscope, acceleration sensor, GPS, and computer vision (Computer Viewer, CV).
  • the detection sensor collects health status data (for example, PPG sensor collects heart rate data, blood oxygen detection sensor collects blood oxygen data, etc.) or collects pedometer data through a wristband or watch, as one of the scene perception data, which is not limited here.
  • the location information may be acquired through the above-mentioned GPS, or the location information may be acquired by the method of acquiring the location information through a wifi signal, which is not limited here.
  • the mobile phone 100 can analyze and obtain scene state data based on the collected scene perception data.
  • one or more judgment rules for scene state data may be preset in the mobile phone 100 .
  • the mobile phone 100 may determine whether the mobile phone 100 is indoors or outdoors according to the noise type or noise level. For example, the mobile phone 100 may set the noise with a noise level in the range of 2 to 4 as the range of indoor noise, and set the range of 4 to 6 as the range of outdoor noise.
  • the corresponding scene state data obtained by the analysis of the mobile phone 100 is indoor noise.
  • the corresponding scene state data obtained by the analysis of the mobile phone 100 is outdoor noise.
  • the mobile phone 100 can set construction noise (such as construction noise generated by road construction equipment) and traffic noise (such as car whistle, car engine sound, tire friction sound, etc.) as outdoor noise, and set up public areas such as shopping malls, airports, and platforms.
  • the sound played in the venue (such as notification playback, music playback, etc.) is indoor noise, and some living noises (such as the sound of playing mahjong, and noise from other entertainment venues, etc.) can be set as indoor noise.
  • the above noise types are mainly based on different sounds.
  • the frequency of the noise and the voiceprint are different for identification.
  • the mobile phone 100 may also simply set the frequency range of the sound as the basis for judging the type of noise, which is not limited herein. Therefore, the mobile phone 100 can obtain scene state data, whether it is an indoor scene or an outdoor scene, by analyzing the noise type or noise level of the sound (noise) collected by the microphone.
  • the mobile phone 100 can determine whether the location of the current scene is a shopping mall or a station based on the GPS location information and the data of the online map; the mobile phone 100 can also photograph the target (text or picture) to be translated based on the GPS location data and the CV. etc., for example, the menu is shot)
  • the scene state data obtained by the analysis is a restaurant or a restaurant in a shopping mall.
  • the mobile phone 100 can also analyze the running and running based on its own gyroscope, acceleration sensor, position data measured by GPS, and heart rate data measured by a wearable device connected to the mobile phone 100, such as a watch.
  • Scene status data such as riding status, pedometer data, and motion trajectories.
  • the mobile phone 100 may also analyze the vehicle scene currently taken by the user based on the images collected by the CV.
  • the images collected by the CV are the seats on the subway and the screen interface of the subway station announcement screen. , bus seats, site information icons posted in the bus, etc., the mobile phone 100 can determine that the scene where the user is located is the scene of taking the subway or taking the bus.
  • the mobile phone 100 can call other scene perception data to replace the missing scene perception data to determine the scene state data. For example, when the GPS of the mobile phone 100 is not turned on, the mobile phone 100 cannot collect location data, and the mobile phone 100 can obtain scene state data by analyzing the sound collected by the microphone and the environmental characteristics collected by the infrared sensor.
  • the scene state data can be obtained by analyzing one or more kinds of scene perception data. Generally, if it is a simple and easily distinguishable scene, the mobile phone 100 can determine its scene state data based on less scene perception data. For example, for a station or an airport scene, The mobile phone 100 may only need to integrate the location information collected by the GPS, or the type of sound in the station collected by the microphone, etc., to determine the state of the basic scene. If it is a scene that is relatively complicated and difficult to distinguish, the mobile phone 100 may need to integrate various sensing data to comprehensively judge the state of the basic scene, which is not limited here.
  • the method of scene perception data obtained by the mobile phone 100 may include scene feature image data (for example, image data collected by CV, etc.), location data (for example, collected by mobile phone gyroscope) angular motion data, location data collected by mobile phone GPS, etc.); sound data (for example, sound data collected through a microphone, etc.), and so on.
  • scene feature image data for example, image data collected by CV, etc.
  • location data for example, collected by mobile phone gyroscope
  • sound data for example, sound data collected through a microphone, etc.
  • the camera of the mobile phone 100 When manually inputting the text to be translated, in this case the camera of the mobile phone 100 does not need to be turned on and thus is not turned on, the mobile phone 100 may not collect image data through CV, and the way of scene perception data obtained by the mobile phone 100 may include location data (for example, Gather gyroscope to measure and collect angular motion data, collect position data through GPS, etc.); sound data (such as collecting sound data through microphone, etc.), motion data (such as collecting heart rate data through smart watch, smart bracelet, collecting acceleration data through acceleration sensor) etc.), environmental data (eg, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.), and so on. It can be understood that, in some scenarios, although there is no shooting interface on the interface of the mobile phone 100, the camera of the mobile phone 100 can also work in the background to collect CV signals.
  • location data for example, Gather gyroscope to measure and collect angular motion data, collect position data through GPS, etc.
  • sound data such
  • the method of data may include location data (such as collecting angular motion data measured by gyroscope, collecting location data through GPS, etc.); user's movement state data (such as collecting heart rate data, blood oxygen data, etc. through smart watches, smart bracelets, etc.) , environmental data (for example, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.), and so on. It can be understood that, in some scenarios, although there is no shooting interface on the interface of the mobile phone 100, the camera of the mobile phone 100 can also work in the background to collect CV signals.
  • location data such as collecting angular motion data measured by gyroscope, collecting location data through GPS, etc.
  • user's movement state data such as collecting heart rate data, blood oxygen data, etc. through smart watches, smart bracelets, etc.
  • environmental data for example, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.
  • the device that acquires the scene perception data and the device that determines the basic scene state may be the same electronic device (for example, the mobile phone 100 may not only collect the scene perception data but also directly analyze the scene state data), or may be different electronic devices,
  • the scene perception data collected by the mobile phone 100 can be sent to the server 200 for further analysis to obtain the scene state data; or the scene perception data can be collected through smart wearable devices such as watches and wristbands, and sent to the mobile phone 100 for further analysis to obtain the scene state data. make restrictions.
  • the mobile phone 100 generates a scene label based on the obtained scene state data.
  • the mobile phone 100 classifies and labels the scene state data obtained by the above analysis, and different scene state data may correspond to the same scene label. Therefore, it can be understood that the correspondence between the scene state data and the scene label is a many-to-one or one-to-one correspondence.
  • the generation of the scene label by the mobile phone 100 based on the scene state data may be accomplished through a pre-trained scene classifier.
  • the mobile phone 100 can perform classification training by inputting the scene state data into a Gradient Boosting Decision Tree (GBDT) classifier, and label the scene state data classified in the same or similar categories with the same scene label.
  • GBDT Gradient Boosting Decision Tree
  • the scene state data can be sample scene state data specially collected for training the classifier, or it can be the scene state data analyzed in the actual application of machine translation, and the scene state data can be accumulated over time to form
  • the scene state database, and the corresponding scene tags can also be accumulated over time to form a scene tag library. Since the storage space occupied by the classifier algorithm is relatively small, the training of the classifier can be performed on an electronic device such as the mobile phone 100, or the training can be completed on the server 200, which is not limited here.
  • the above GBDT classifier is a classifier that applies the GBDT algorithm.
  • the GBDT algorithm is one of the best algorithms for fitting real distributions in traditional machine learning algorithms. It can be used for both classification and regression. Filter features.
  • the principle of the GBDT algorithm is to generate a weak classifier through multiple rounds of iterations, and each classifier is trained on the basis of the residual of the previous round of classifiers.
  • the requirements for weak classifiers are generally simple enough and low variance and high bias. Because the training process is to continuously improve the accuracy of the final classifier by reducing the bias.
  • the decision tree used by the GBDT algorithm is the CART regression tree.
  • the GBDT classifier can classify the scene state data, and then manually label the scene state data in the same or similar categories through the mobile phone 100 or the server 200.
  • the same scene label is obtained after training with a large number of scene state sample data.
  • a number of scene tags form the scene tag database.
  • the scene state data that the mobile phone 100 can obtain based on GPS data is a well-known restaurant or a shopping mall
  • the scene state data (such as indoor noise) that can be obtained based on the noise type of the sound collected by the microphone
  • the photo of the target item taken based on CV and the collected photos of the surrounding environment can be analyzed to obtain that the scene state data is a menu.
  • it can be determined that the current scene is a menu translation scene in restaurant dining, so “restaurant” or “restaurant” can be marked as the scene label.
  • the mobile phone 100 or the server 200 can also train other classification algorithm models so that they can generate scene labels corresponding to the scene state data.
  • Other classification algorithms include but are not limited to support vector machine (Support Vector Machine, SVM) algorithms, logistic regression (Logistic Regress) , LR) algorithm, Iterative Dichotomiser 3 (ID3) decision tree algorithm, etc., which are not limited here.
  • the mobile phone 100 encodes the above-mentioned scene tag and the text to be translated together as a source language sequence through an encoder in the NMT-Transformer, and extracts the scene and to-be-translated text information in the source language sequence.
  • the scene label and the text to be translated are input into the encoder of the NMT-Transformer (the encoding layer in the Transformer network) for encoding, wherein the encoding layer in the Transformer network is implemented by a multi-layer self-attention network. , where the attention vector output by each layer of self-attention network will be used as the input of the next layer of self-attention network.
  • FIG. 5 shows a schematic diagram of the process of embedding scene tags in the text to be translated in an encoding process.
  • X 5 , X 6 , X 7 , and X 8 represent words that constitute the scene label
  • X 9 , X 10 , X 11 , and X 12 represent words that constitute the to-be-translated word for text.
  • the text to be translated is the text data obtained by converting the photographed menu
  • the scene label is a restaurant or restaurant.
  • the dish name in the menu has a higher correlation with the scene label, while the price of the dish has a low or no correlation with the scene label.
  • the input of the scene label should be placed before the input of the dish name text, so that the scene label and the dish name text are closer.
  • each line of text on the menu is a dish Name + price + description of the dish, so you can enter the scene label "restaurant" before each line of text on the menu is entered into the Transformer network.
  • the source language sequence of the input Transformer network is: Restaurant BATTERED WHITING 16.0M
  • Restaurant pays more attention to BATTERED WHITING and has a greater impact on its translation results.
  • BATTERED WHITING is translated as fried cod instead of before The wrong translation is the battered cod; while the distance between Restaurant and 16.0M
  • the mobile phone 100 decodes the scene information and the to-be-translated text information extracted from the scene label and the to-be-translated text in the encoding phase by using the decoder in the NMT-Transformer to decode the translated text expressed in the target language by word, and outputs the translation result.
  • the decoder (the decoding layer in the Transformer network) selects the target language for decoding based on the scene information extracted by the encoder during the encoding process to obtain the translation.
  • the decoding layer in the Transformer network is also implemented by a multi-layer self-attention network.
  • the Transformer network since in the encoding stage of NMT-Transformer, the Transformer network has already extracted the scene information in the embedded scene label while extracting the text information to be translated. Therefore, in the decoding stage of NMT-Transformer, the Transformer network can be directly based on the scene label.
  • the scene information of the target language is selected for decoding, so as to obtain a translation that is more in line with the scene.
  • FIG. 6 is a schematic diagram of interface comparison of a scene-based short text translation result according to the present application.
  • Figure 6(a) it is the translation result interface of a traditional translation device or translation equipment.
  • the final decoded translation of the dish name Fisherman's Basket to be translated is: Fisherman's Basket, which is obviously a wrong translation result;
  • FIG. 6(b) for the translation result interface of the translation device using the fusion scene-aware machine translation method of the present application, based on the scene information extracted from the scene tag (restaurant), the translation obtained by decoding the dish Fisherman's Basket is a seafood platter , the translation result is correct.
  • FIG. 7 is a schematic diagram of interface comparison of another scene-based short text translation result according to the present application. As shown in Figure 7, it is a display of the translation result of the entry card filled in the entry and exit scene. Among them, the text to be translated shown in Figure 7(a) is the source language of the entry card (customs declaration card); Figure 7(b) shows the translation result of the traditional translation device; Figure 7(c) shows the application of this The translation result of the translation device that applies the fusion scene-aware machine translation method.
  • S1 Acquire the text to be translated, scene perception data, and obtain scene state data based on the scene perception data.
  • the mobile phone 100 by turning on the camera of the mobile phone 100 to photograph the page of the entry-exit card, the mobile phone 100 extracts the text to be translated on the photographed entry-exit card through its own image recognition system.
  • the mobile phone 100 collects position data through GPS, sound data through microphone, and environmental characteristic image data through CV as scene perception data. Based on the collected scene perception data, the scene status data is correspondingly obtained. For example, the location data collected by GPS is used to determine whether the location or the nearby marked geographic location mark is an airport or a customs, etc., and the sound data collected by the microphone is used to determine whether the environment is an indoor environment. It is still an outdoor environment, and the CV collects environmental characteristic image data to determine whether there is a registration window, registration form, etc. in the environment. If the use of CV is restricted in some entry and exit scenarios, it is also possible not to collect image data through CV as scene perception data, and to obtain scene state data through other scene perception data.
  • the quantity and type of scene perception data collected by the mobile phone 100 are not limited herein. Specific reference is made to the foregoing step 301 and related descriptions, which will not be repeated here.
  • the mobile phone 100 uses the encoder in the NMT-Transformer to encode the above-generated entry-exit scene label and the text to be translated together as a source language sequence, and extracts entry-exit scene information and to-be-translated text information in the source language sequence.
  • the encoder in the NMT-Transformer uses the encoder in the NMT-Transformer to encode the above-generated entry-exit scene label and the text to be translated together as a source language sequence, and extracts entry-exit scene information and to-be-translated text information in the source language sequence.
  • the mobile phone 100 uses the decoder in the NMT-Transformer to decode the entry-exit scene information and the to-be-translated text information extracted from the entry-exit scene label and the to-be-translated text in the encoding phase to decode the translated text expressed in the target language by word, and output the translation result , as shown in Figure 7(c).
  • the decoder in the NMT-Transformer uses the decoder in the NMT-Transformer to decode the entry-exit scene information and the to-be-translated text information extracted from the entry-exit scene label and the to-be-translated text in the encoding phase to decode the translated text expressed in the target language by word, and output the translation result , as shown in Figure 7(c).
  • the specific decoding process refer to the above-mentioned step 304 and related descriptions, which will not be repeated here.
  • the mobile phone 100 can be embedded with an application to which the fusion scene-aware machine translation method of the present application is applied to achieve accurate translation of scene-based short texts; application software can also be installed on the mobile phone 100,
  • the interaction with the server 200 sends the text to be translated to the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method;
  • the interaction with the server 200 sends the text to be translated to the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method, which is not limited here.
  • FIG. 8 shows a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.
  • the mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user Identity module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 .
  • the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal Image signal processor (ISP), controller, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the mobile phone 100 can train the scene classifier and the encoder and decoder of the NMT-Transformer through the processor 110, and during the actual scene-based short text translation
  • the acquired scene perception data and the text to be translated are processed and the fusion scene perception machine translation method described in the above steps 301 to 304 is executed.
  • the processor 110 may include one or more interfaces.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 .
  • the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the mobile phone 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140 to supply power to the processor 110 , the internal memory 121 , the display screen 194 , the camera 193 , and the wireless communication module 160 .
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • the mobile phone 100 implements communication and data transmission with the server 200 through the above-mentioned wireless communication function.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone 100 .
  • the wireless communication module 160 can provide applications on the mobile phone 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • Wi-Fi wireless fidelity
  • BT Bluetooth
  • global navigation satellite system global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the antenna 1 of the mobile phone 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (GLONASS), a Beidou navigation satellite system (BDS), a quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation systems
  • the mobile phone 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the image or text collected by the mobile phone 100 for the text to be translated of the scene-based short text is displayed on the display screen 194, and the translation result of the text to be translated by the mobile phone 100 is also displayed on the display screen 194 for feedback to the user.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the mobile phone 100 may include one or N display screens 194 , where N is a positive integer greater than one.
  • the SIM card interface 195 is used to connect a SIM card.
  • the mobile phone 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the collection of the CV signal by the mobile phone 100 can also be realized by the above-mentioned shooting function, that is, the image of the scene or the image of the text to be translated is collected by the above-mentioned shooting function.
  • Camera 193 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the mobile phone 100 may include one or N cameras 193 , where N is a positive integer greater than one.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 .
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100 and the like.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.
  • the processor 110 executes the fusion scene-aware machine translation method of the present application by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.
  • the mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • the microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the mobile phone 100 may be provided with at least one microphone 170C.
  • the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals.
  • the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the microphone 170C can collect sound signals and determine the noise level or noise type of the collected sound signals, so as to further analyze the scene state data, such as whether it is indoors or outdoors.
  • the earphone jack 170D is used to connect wired earphones.
  • the gyroscope sensor 180B can be used to determine the motion attitude of the mobile phone 100 .
  • the angular velocity of cell phone 100 about three axes may be determined by gyro sensor 180B.
  • the gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyroscope sensor 180B detects the shaking angle of the mobile phone 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the mobile phone 100 through reverse motion to realize anti-shake.
  • the gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the mobile phone 100 in various directions (generally three axes). When the mobile phone 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to recognize the posture of mobile phones, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc. During the implementation of the present application, certain scene state data, such as the user's walking, running and riding state, can be obtained by analyzing the shaking state data measured by the gyro sensor 180B and the acceleration data measured by the acceleration sensor 180E.
  • the cell phone 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the mobile phone 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the mobile phone 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the mobile phone 100 is in the pocket, so as to prevent accidental touch.
  • the scene state data may be analyzed based on the ambient light brightness sensed by the ambient light sensor 180L, for example, to determine whether the current scene is indoor or outdoor.
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the cell phone 100 can receive key input and generate key signal input related to user settings and function control of the cell phone 100 .
  • the software system of the mobile phone 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present invention take an Android system with a layered architecture as an example to illustrate the software structure of the mobile phone 100 as an example.
  • FIG. 9 is a block diagram of a software structure of a mobile phone 100 according to an embodiment of the present invention.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication function of the mobile phone 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the mobile phone 100 vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to realize 3D graphics drawing, image rendering, compositing and layer processing, etc.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • the workflow of the software and hardware of the mobile phone 100 is exemplarily described below with reference to the menu translation scenario.
  • the kernel layer processes touch operations into original input events (including opening translation software, or opening the camera 193 and other operations).
  • Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, as an example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer, and then starts the camera driver by calling the kernel layer.
  • the camera 193 captures a still image or video of the menu to be translated.
  • the present disclosure also relates to apparatuses for performing operations in text.
  • This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus.
  • the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.

Abstract

A fusion scene perception machine translation method, a storage medium, and an electronic device, which relate to the technical field of neural network machine translation. The fusion scene perception machine translation method comprises: obtaining text to be translated and scene perception data; determining, according to the scene perception data, the scene in which an electronic device is located; generating a scene tag corresponding to the scene in which the electronic device is located; jointly using the scene tag and said text as a source language sequence and inputting same into an encoder that is used for translation, performing encoding, and obtaining encoded data of fusion scene perception; and by means of a decoder used for translation, decoding the encoded data and converting same into a target language, and obtaining a translation result for fusion scene perception. By means of the electronic device, scene perception data is collected and a scene tag is generated, then fusion, translation, encoding, and decoding are performed on the scene tag and text to be translated, and a translation result that is in line with the scene of said text is obtained, thus greatly improving the translation accuracy of scene short text.

Description

融合场景感知机器翻译方法、存储介质及电子设备Fusion scene-aware machine translation method, storage medium and electronic device
本申请要求于2020年10月10日提交中国专利局、申请号为202011079936.X、申请名称为“融合场景感知机器翻译方法、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on October 10, 2020 with the application number 202011079936.X and the application name "Integrated scene-aware machine translation method, storage medium and electronic device", the entire content of which is Incorporated herein by reference.
技术领域technical field
本发明涉及神经网络机器翻译技术领域,具体涉及一种融合场景感知机器翻译方法、存储介质及电子设备。The invention relates to the technical field of neural network machine translation, in particular to a fusion scene perception machine translation method, a storage medium and an electronic device.
背景技术Background technique
从早期的词典匹配,到词典结合语言学专家知识的规则翻译,再到基于语料库的统计机器翻译,随着计算机计算能力的提升和数据的爆发式增长,基于深度神经网络的翻译,即神经机器翻译(Neural Machine Translation,NMT)应用越来越广泛。NMT发展历程包含两个阶段:第一阶段(2014年-2017年)基于循环神经网络(Recurrent Neural Network,RNN)的NMT,它的核心网络架构是一个RNN;第二阶段(2017年至今)基于Transformer神经网络的NMT(下称NMT-Transformer),它的核心网络架构是一个Transformer模型。在机器翻译领域,RNN逐步被Transformer所代替。From the early dictionary matching, to the rule translation in which the dictionary combines the knowledge of linguistic experts, to the statistical machine translation based on the corpus, with the improvement of computer computing power and the explosive growth of data, the translation based on deep neural network, that is, neural machine translation. Translation (Neural Machine Translation, NMT) is more and more widely used. The development process of NMT consists of two stages: the first stage (2014-2017) is based on the Recurrent Neural Network (RNN) NMT, and its core network architecture is an RNN; the second stage (2017-present) is based on The NMT of the Transformer neural network (hereinafter referred to as NMT-Transformer), its core network architecture is a Transformer model. In the field of machine translation, RNN is gradually replaced by Transformer.
但当前基于NMT-Transformer的主流翻译设备或产品面临一个相同的难题:在处理场景化短文本时翻译准确率低,主要是由于场景化短文本通常为若干个词或字组成,其含义与所处场景密切相关,但场景化短文本缺少关键的上下文信息,因而导致NMT-Transformer翻译不准确。However, the current mainstream translation equipment or products based on NMT-Transformer face the same problem: the translation accuracy is low when dealing with scene-based short text, mainly because the scene-based short text is usually composed of several words or characters, and its meaning is the same as the However, the contextual information of short contextual texts is lacking, resulting in inaccurate translation by NMT-Transformer.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种融合场景感知机器翻译方法、存储介质及电子设备,基于电子设备采集的场景感知数据生成场景标签,再通过将生成的场景标签与待翻译文本共同作为源语言序列在Transformer网络的编码阶段进行融合编码并提取该源语言序列中的信息,最终在Transformer网络的解码阶段将该源语言序列中的信息转换成目标语言,解码得到符合待翻译文本场景的翻译结果,从而大大提高了场景化短文本的翻译准确率。The embodiments of the present application provide a fusion scene-aware machine translation method, a storage medium, and an electronic device, where a scene label is generated based on scene-aware data collected by the electronic device, and the generated scene label and the text to be translated are used together as a source language sequence in the The encoding stage of the Transformer network performs fusion coding and extracts the information in the source language sequence. Finally, in the decoding stage of the Transformer network, the information in the source language sequence is converted into the target language, and the translation result conforming to the text scene to be translated is obtained by decoding. Greatly improves the translation accuracy of scene-based short texts.
第一方面,本申请实施例提供了一种融合场景感知机器翻译方法,用于具有机器翻译功能的电子设备,包括:获取待翻译文本以及场景感知数据,所述场景感知数据由所述电子设备采集得到,并用于确定所述电子设备所处的场景;根据所述场景感知数据,确定所述电子设备所处的场景;基于所述电子设备所处的场景,生成与所述场景对应的场景标签;将所述场景标签与所述待翻译文本共同作为源语言序列输入用于翻译的编码器进行编码,得到融合场景感知的编码数据;将融合场景感知的编码数据通过用于翻译的解码器进行解码及目标语言转换,得到融合场景感知的翻译结果。In a first aspect, an embodiment of the present application provides a fusion scene-aware machine translation method for an electronic device with a machine translation function, including: acquiring text to be translated and scene-awareness data, where the scene-awareness data is stored by the electronic device collected and used to determine the scene where the electronic device is located; determine the scene where the electronic device is located according to the scene perception data; generate a scene corresponding to the scene based on the scene where the electronic device is located label; inputting the scene label and the text to be translated together as a source language sequence into an encoder for translation, to obtain encoded data of fusion scene perception; passing the encoded data of fusion scene perception through a decoder for translation Decoding and target language conversion are performed to obtain translation results that incorporate scene perception.
例如,具有机器翻译功能的手机可以采集场景感知数据,采集得到的场景感知数据用于确 定手机当前所处的场景,进而能够生成对应手机当前所处场景的场景标签,在使用手机进行翻译时,手机可以融合生成的场景标签进行机器翻译的编解码,进而得到融合场景感知的翻译结果。For example, a mobile phone with a machine translation function can collect scene perception data, and the collected scene perception data can be used to determine the scene where the mobile phone is currently located, and then can generate a scene label corresponding to the scene where the mobile phone is currently located. When using the mobile phone for translation, The mobile phone can fuse the generated scene tags to encode and decode machine translation, and then obtain the translation result of fusion scene perception.
在上述第一方面的一种可能的实现中,上述方法还包括:根据所述场景感知数据确定所述电子设备所处的场景的特点,以得到场景状态数据,所述场景状态数据用于表征所述电子设备所处的场景;并且对所述场景状态数据进行分类统计,确定所述电子设备所处的场景。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: determining the characteristics of the scene in which the electronic device is located according to the scene perception data, so as to obtain scene state data, where the scene state data is used to represent The scene in which the electronic device is located; and the scene state data is classified and counted to determine the scene in which the electronic device is located.
例如,具有机器翻译功能的手机能够基于采集到的场景感知数据确定当前所处场景的特点,将确定的场景特点记为场景状态信息。对上述得到的场景状态数据进行分类统计,可以得到包含一个或多个场景状态信息的场景类型,每种场景类型便对应手机所处的一种场景。For example, a mobile phone with a machine translation function can determine the characteristics of the current scene based on the collected scene perception data, and record the determined scene characteristics as scene state information. The above obtained scene state data is classified and counted to obtain scene types including one or more scene state information, and each scene type corresponds to a scene in which the mobile phone is located.
在上述第一方面的一种可能的实现中,上述方法还包括:所述场景感知数据由所述电子设备中设置的检测元件采集得到,所述检测元件包括GPS元件、摄像头、麦克风、传感器中的至少一个。所述场景感知数据包括位置数据、图像数据、声音数据、加速度数据、环境温度数据、环境光强度数据及角运动数据中的一种或多种。In a possible implementation of the first aspect, the method further includes: the scene perception data is acquired by a detection element set in the electronic device, the detection element includes a GPS element, a camera, a microphone, a sensor, etc. at least one of. The scene perception data includes one or more of position data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data, and angular motion data.
例如,手机可以通过自身的GPS元件不断采集手机当前的位置数据、通过自身的麦克风采集周围的声音数据、通过自身的温度传感器和环境光传感器采集当前场景环境的温度数据、光强度数据等、以及通过陀螺仪采集手机当前的角运动数据等,手机还可以通过摄像头采集周围环境的特征物图像数据以及拍摄待翻译文本所在的图像数据等,手机采集的这些场景感知数据有一些在进一步使用时可能是无效的,而大部分场景感知数据在后续确定手机所处的场景时是有效数据,可以用于确定手机所处的场景。For example, the mobile phone can continuously collect the current location data of the mobile phone through its own GPS element, the surrounding sound data through its own microphone, and the temperature data and light intensity data of the current scene environment through its own temperature sensor and ambient light sensor, and The mobile phone can collect the current angular motion data of the mobile phone through the gyroscope. The mobile phone can also collect the feature image data of the surrounding environment and the image data of the text to be translated through the camera. Some of these scene perception data collected by the mobile phone may be used further. is invalid, and most of the scene perception data is valid data when determining the scene where the mobile phone is located, and can be used to determine the scene where the mobile phone is located.
在上述第一方面的一种可能的实现中,根据所述场景感知数据确定所述电子设备所处的场景的特点,包括以下情形的一种或多种:根据所述位置数据,确定所述场景的位置名称;根据所述图像数据中的文字、目标物中的一种或多种确定所述场景中的特征文字或特征物,并确定所述场景的环境特征;根据所述声音数据中的频率、声纹、幅值中的一种或多种确定所述场景中的噪音类型或噪音等级,并确定所述场景属于室内或室外;根据所述加速度数据、所述角运动数据,确定所述电子设备在所述场景中的运动状态;根据所述环境温度数据、所述环境光强度数据,确定所述场景的温度级别及光强度级别,并确定所述场景属于室内或室外。In a possible implementation of the first aspect, the characteristics of the scene in which the electronic device is located is determined according to the scene perception data, including one or more of the following situations: determining the scene according to the position data The location name of the scene; according to one or more of the words in the image data and the object, determine the characteristic words or characteristic objects in the scene, and determine the environmental characteristics of the scene; according to the sound data One or more of the frequency, voiceprint, and amplitude of the scene to determine the noise type or noise level in the scene, and determine that the scene belongs to indoor or outdoor; according to the acceleration data and the angular motion data, determine The motion state of the electronic device in the scene; according to the ambient temperature data and the ambient light intensity data, determine the temperature level and light intensity level of the scene, and determine whether the scene belongs to indoors or outdoors.
例如,手机可以通过位置数据确定所处场景的位置名称,例如是某商场或机场等。手机还可以通过摄像头采集的图像数据确定所处场景中的特征物,例如采集的图像数据是地铁座椅及地铁站点信息则可以确定所处场景是乘地铁的场景,如果待翻译文本是地铁的站点信息,则可以通过手机融合乘地铁的场景进行场景化短文本的翻译。手机还可以通过采集的声音数据来确定当前在室内还是在室外,有些声音数据还可以初步判断所处的场景,例如通过采集到的施工噪音确定当前在室外,通过采集到的麻将碰撞的声音来确定当前在室内并且可能是打麻将的场景中。手机通过加速度传感器采集的加速度数据以及陀螺仪采集的角运动数据可以用于确定自身当前的运动状态,例如乘地铁与乘公交场景中加速度数据不同,根据加速度数据可以确定乘坐的是哪种交通工具。手机还可以通过环境温度传感器以及环境光传感器采集环境温度数据或环境光强度数据来确定其所在场景属于室内还是室外,一般夏天室内温度低于室外温度,冬天室内温度高于室外温度,白天室内的光强度低于室外的光强度,夜晚室内打开灯光时光强度高于室外的光强度。For example, the mobile phone can determine the location name of the scene in which it is located, such as a shopping mall or an airport, through location data. The mobile phone can also determine the features in the scene through the image data collected by the camera. For example, if the collected image data is subway seats and subway station information, it can be determined that the scene is the scene of taking the subway. If the text to be translated is from the subway Site information, you can use the mobile phone to integrate the scene of taking the subway for scene-based short text translation. The mobile phone can also determine whether it is indoors or outdoors through the collected sound data, and some sound data can also preliminarily determine the scene. Make sure you are indoors and possibly playing mahjong. The acceleration data collected by the mobile phone through the acceleration sensor and the angular motion data collected by the gyroscope can be used to determine the current state of motion. For example, the acceleration data in the subway and bus scenarios are different, and the vehicle you are taking can be determined according to the acceleration data. . The mobile phone can also collect ambient temperature data or ambient light intensity data through the ambient temperature sensor and ambient light sensor to determine whether the scene it is in belongs to indoors or outdoors. Generally, the indoor temperature is lower than the outdoor temperature in summer, and the indoor temperature is higher than the outdoor temperature in winter. The light intensity is lower than the outdoor light intensity, and the indoor light intensity is higher than the outdoor light intensity at night.
在上述第一方面的一种可能的实现中,上述方法还包括:根据所述场景感知数据确定用户运动状态,所述用户运动状态用于确定所述场景的特点;其中,所述场景感知数据包括心率数据、血氧数据中的一种或多种。In a possible implementation of the first aspect, the method further includes: determining a user motion state according to the scene perception data, where the user motion state is used to determine the characteristics of the scene; wherein the scene perception data Including one or more of heart rate data and blood oxygen data.
例如,手机可以通过连接穿戴设备获取用户的心率数据、血氧数据等,例如通过智能手表或手环采集用户的心率数据或血氧数据,来确定用户是否处于运动状态,用户运动时或运动量增大时,心率会增大,血氧量也会有较大幅度的变化。通过用户的运动数据,可以进一步地确定手机所处的场景,例如,用户在健身房健身时心率数据会有较大变化,当在健身房内需要使用翻译服务时,手机可以基于心率数据的变化来确定健身场景以实现融合场景感知翻译场景化短文本。通过心率数据或血氧数据还可以确定用户当前所处海拔,例如,用户在进行爬山活动的过程中,可以通过海拔及位置数据等确定用户当前所处的场景,以使手机能够实现融合场景感知翻译场景化短文本。For example, the mobile phone can obtain the user's heart rate data, blood oxygen data, etc. by connecting to the wearable device. For example, the user's heart rate data or blood oxygen data can be collected through a smart watch or a wristband to determine whether the user is in a state of exercise. When the user is exercising or the amount of exercise increases When the heart rate is high, the heart rate will increase, and the blood oxygen level will also have a large change. Through the user's exercise data, the scene where the mobile phone is located can be further determined. For example, the heart rate data will change greatly when the user is exercising in the gym. When the translation service needs to be used in the gym, the mobile phone can be determined based on the change in the heart rate data. Fitness scene to achieve integrated scene-aware translation of scene-based short texts. The current altitude of the user can also be determined through heart rate data or blood oxygen data. For example, during the process of mountain climbing, the user can determine the current scene of the user through altitude and location data, so that the mobile phone can realize integrated scene perception Translate contextualized short texts.
在上述第一方面的一种可能的实现中,上述方法还包括:所述场景标签与所述待翻译文本输入所述编码器的顺序基于所述待翻译文本中的文本内容与所述场景标签之间的相关度确定;所述待翻译文本中的文本内容与所述场景标签的相关度越大,所述待翻译文本中的文本内容与所述场景标签的输入距离越接近。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: an order in which the scene label and the text to be translated are input into the encoder based on the text content in the to-be-translated text and the scene label The correlation between the texts to be translated is determined; the greater the correlation between the text content in the to-be-translated text and the scene label, the closer the input distance between the text content in the to-be-translated text and the scene label is.
所述融合场景感知的编码数据包括编码过程中所述编码器提取的所述场景标签中的场景特点信息及所述待翻译文本中的文本内容信息,并且所述编码器按照所述场景标签与所述待翻译文本输入所述编码器的顺序提取所述场景特点信息和所述文本内容信息。The encoded data of the fusion scene perception includes the scene feature information in the scene label extracted by the encoder during the encoding process and the text content information in the text to be translated, and the encoder is based on the scene label and the scene label. The sequence in which the text to be translated is input to the encoder extracts the scene feature information and the text content information.
例如,在就餐场景中,由于场景标签与菜名的相关度更高,因此编码时可以将场景标签“restaurant”置于待翻译的菜单文本中的菜名之前输入编码器,编码器在编码时提取的场景特点信息与菜名更靠近,因此场景特点信息对菜名的翻译影响也就更大。For example, in the dining scene, since the scene label has a higher correlation with the dish name, the scene label "restaurant" can be placed before the dish name in the menu text to be translated and input into the encoder. The extracted scene feature information is closer to the dish name, so the scene feature information has a greater impact on the translation of the dish name.
在上述第一方面的一种可能的实现中,上述方法还包括:所述解码器基于所述场景特点信息在所述目标语言中选择与所述文本内容信息相对应的词,生成所述融合场景感知的翻译结果。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the decoder selects a word corresponding to the text content information in the target language based on the scene feature information, and generates the fusion Scene-aware translation results.
例如,在就餐场景中,解码器基于就餐场景特点信息在目标语言中选择与菜名对应的词组成译文,最终便能够得到准确的菜名翻译结果。For example, in a dining scene, the decoder selects words corresponding to the dish name in the target language based on the feature information of the dining scene to form a translation, and finally can obtain an accurate dish name translation result.
在上述第一方面的一种可能的实现中,上述方法还包括:所述基于所述场景状态数据生成场景标签通过分类器实现,所述编码器和所述解码器通过神经网络模型实现。所述分类器通过分类算法实现对所述场景状态数据的分类计算,所述分类算法包括梯度提升树分类算法、支持向量机算法、逻辑回归算法、迭代二叉树3代决策树算法中的任意一种;所述神经网络模型包括基于Transformer网络的循环神经网络机器翻译模型。In a possible implementation of the above-mentioned first aspect, the above-mentioned method further includes: the generating a scene label based on the scene state data is implemented by a classifier, and the encoder and the decoder are implemented by a neural network model. The classifier realizes the classification calculation of the scene state data through a classification algorithm, and the classification algorithm includes any one of a gradient boosting tree classification algorithm, a support vector machine algorithm, a logistic regression algorithm, and an iterative binary tree 3-generation decision tree algorithm. ; The neural network model includes a cyclic neural network machine translation model based on Transformer network.
例如,通过手机内的分类器对场景状态数据进行分类统计,确定手机所处的场景,进而生成场景标签。在使用手机进行翻译时,手机融合场景标签通过训练好的NMT-Transformer翻译模型完成场景化短文本的翻译。For example, the scene state data is classified and counted by the classifier in the mobile phone to determine the scene in which the mobile phone is located, and then the scene label is generated. When using a mobile phone for translation, the mobile phone integrates scene tags to complete the translation of scene-based short texts through the trained NMT-Transformer translation model.
第二方面,本申请实施例提供了一种可读介质,所述可读介质上存储有指令,该指令在交互设备上执行时使电子设备执行上述融合场景感知机器翻译方法。In a second aspect, an embodiment of the present application provides a readable medium, where an instruction is stored on the readable medium, and the instruction, when executed on an interactive device, causes the electronic device to execute the above-mentioned fusion scene-aware machine translation method.
第三方面,本申请实施例提供了一电子设备,包括:存储器,用于存储由电子设备的一个或多个处理器执行的指令,以及处理器,是电子设备的处理器之一,用于执行上述融合场景感知机器翻译方法。In a third aspect, embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, for Perform the fusion scene-aware machine translation method described above.
附图说明Description of drawings
图1为本申请的融合场景感知机器翻译方法的应用场景示意图;1 is a schematic diagram of an application scenario of the fusion scene-aware machine translation method of the present application;
图2为目前的翻译设备在翻译场景化短文本时的错误翻译示例示意图;Fig. 2 is a schematic diagram of an example of wrong translation of the current translation device when translating a scene-based short text;
图3为本申请的融合场景感知机器翻译方法的步骤示意图;3 is a schematic diagram of the steps of the fusion scene-aware machine translation method of the present application;
图4为本申请的的融合场景感知机器翻译方法中的数据转化流程示意图;4 is a schematic diagram of a data conversion process in the fusion scene-aware machine translation method of the application;
图5为本申请的一种编码过程在待翻译文本中嵌入场景标签的过程示意图;5 is a schematic diagram of a process of embedding a scene tag in a text to be translated in an encoding process of the present application;
图6为本申请的一种场景化短文本翻译结果的界面对比示意图;6 is a schematic diagram of interface comparison of a scene-based short text translation result according to the present application;
图7为本申请的另一种场景化短文本翻译结果的界面对比示意图;7 is a schematic diagram of interface comparison of another scene-based short text translation result of the application;
图8为本申请的实施例的手机100的结构示意图;FIG. 8 is a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application;
图9为本申请的实施例的手机100的软件结构框图。FIG. 9 is a block diagram of a software structure of a mobile phone 100 according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面通过结合附图和实施方案,对本申请实施例的技术方案做进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the technical solutions of the embodiments of the present application will be further described in detail below with reference to the accompanying drawings and embodiments.
图1所示为本申请融合场景感知机器翻译方法的应用场景示意图。FIG. 1 is a schematic diagram of an application scenario of the fusion scene-aware machine translation method of the present application.
如图1所示,该场景包括电子设备100和电子设备200,其中,电子设备100和电子设备200通过网络连接并进行数据信息交互,电子设备100或电子设备200具有机器翻译功能。用户通过电子设备100拍照、拍短视频或者通过直接输入待翻译文本进行翻译,其中,电子设备100所拍照片或短视频、以及用户向电子设备100输入的语音指令,需要先通过电子设备100的图文转换功能或语音识别功能将其转换成待翻译文本数据后再进行翻译。As shown in FIG. 1 , the scene includes electronic device 100 and electronic device 200 , wherein electronic device 100 and electronic device 200 are connected through a network and perform data information exchange, and electronic device 100 or electronic device 200 has a machine translation function. The user uses the electronic device 100 to take pictures, shoot a short video, or directly input the text to be translated for translation. The image-to-text conversion function or the speech recognition function converts it into the text data to be translated and then translates it.
电子设备200可以用来训练NMT-Transformer以使其能够融合场景标签进行翻译编码和翻译解码,电子设备200还可以用来训练分类器以使其能够基于场景感知数据生成场景标签。电子设备200训练好的NMT-Transformer和分类器可以移植到电子设备100中使用。The electronic device 200 can be used to train an NMT-Transformer to enable it to fuse scene labels for translation encoding and translation decoding, and the electronic device 200 can also be used to train a classifier to enable it to generate scene labels based on scene-aware data. The NMT-Transformer and classifier trained by the electronic device 200 can be transplanted into the electronic device 100 for use.
电子设备100可以通过自带的翻译功能进行翻译,或者通过打开本地安装的翻译软件或通过打开在线翻译的网页与电子设备200进行数据交互,以完成对上述待翻译文本的翻译。The electronic device 100 can perform translation through its own translation function, or perform data interaction with the electronic device 200 by opening a locally installed translation software or opening an online translated webpage to complete the translation of the text to be translated.
电子设备100为与用户进行交互的终端设备,其上安装有能够执行基于神经网络机器翻译(Neural Machine Translation,NMT)的NMT-Transformer的应用软件或应用系统。电子设备100上还可以安装有人机对话系统以识别用户请求执行翻译功能的语音指令,并且电子设备100可以具有识别图片或视频上文字的功能以将图片或视频识别成文字数据,进一步进行翻译。The electronic device 100 is a terminal device that interacts with a user, on which is installed application software or an application system capable of executing NMT-Transformer based on Neural Machine Translation (NMT). A man-machine dialogue system may also be installed on the electronic device 100 to recognize a user's voice command requesting to perform a translation function, and the electronic device 100 may have a function of recognizing text on pictures or videos to recognize the pictures or videos as text data for further translation.
神经网络机器翻译(Neural Machine Translation,NMT)是最近几年提出来的一种机器翻译方法。相比于传统的统计机器翻译(Statistical Machine Translation,SMT)而言,NMT能够训练一张能够从一个序列映射到另一个序列的神经网络,输出的可以是一个变长的序列,这在翻译、对话和文字概括方面能够获得非常好的表现。NMT其实是一个编码器(encoder)-解码器(decoder)系统,其中,encoder把源语言序列进行编码,并提取源语言中的信息,通过decoder再把这种信息转换到另一种语言即目标语言中来,从而完成对语言的翻译。目前主流的机器翻译方法是基于NMT-Transformer,它的核心网络架构是一个Transformer网络,也就是说NMT-Transformer中的编码器和解码器功能是通过Transformer网络实现的。其中,Transformer 网络核心是self-attention层,即计算向量空间的自注意力,自注意力可以理解为相关度,相关度大的两个向量之间自注意力大,相关度小或者不相关的两个向量之间自注意力小或者为0。Neural Machine Translation (NMT) is a machine translation method proposed in recent years. Compared with traditional Statistical Machine Translation (SMT), NMT can train a neural network that can map from one sequence to another, and the output can be a variable-length sequence. Excellent performance in dialogue and text summarization. NMT is actually an encoder-decoder system, in which the encoder encodes the source language sequence, extracts the information in the source language, and then converts this information to another language through the decoder. language, so as to complete the translation of the language. The current mainstream machine translation method is based on NMT-Transformer, and its core network architecture is a Transformer network, which means that the encoder and decoder functions in NMT-Transformer are implemented through the Transformer network. Among them, the core of the Transformer network is the self-attention layer, which is to calculate the self-attention of the vector space. Self-attention can be understood as the degree of correlation. The self-attention between the two vectors is small or 0.
如上所述,目前的NMT-Transformer翻译方法在场景化短文本翻译时存在因缺少上下文导致翻译不准确的问题。例如,如图2所示,在餐厅的就餐场景中,当前的翻译设备可能会将“battered whiting”错误地翻译为“遭受重创的鳕鱼”,与真实的菜名(正确翻译结果应该是:炸鳕鱼)很不相关。因此,在场景化短文本翻译场景中的,现有的机器翻译方法翻译准确率低,导致用户体验差。场景化短文本包括但不限于菜单中的菜单名、商场中的店名、出入境卡中的专用术语等。为解决这一技术问题,目前传统的解决方式是在机器翻译的解码阶段或后处理阶段添加其他维度的额外信息来代替文本的上下文信息以提高短文本的翻译准确率,但这种解决方式本质上是对翻译结果的二次修正,由于添加的额外信息比较单一且与待翻译文本的匹配准确度不高,导致短文本的翻译准确率仍然比较低;并且这种添加额外信息的方式会导致解码提取信息时噪声较大,无法直接用于Transformer网络,因此不适用于当前主流的NMT-Transformer。As mentioned above, the current NMT-Transformer translation method has the problem of inaccurate translation due to lack of context when translating short text in context. For example, as shown in Figure 2, in the dining scene of a restaurant, the current translation equipment may incorrectly translate "battered whiting" as "battered cod", which is different from the real dish name (the correct translation result should be: fried cod) is very irrelevant. Therefore, in the scene-based short text translation scenario, the existing machine translation methods have low translation accuracy, resulting in poor user experience. Scenario-based short texts include but are not limited to menu names in menus, store names in shopping malls, special terms in immigration cards, etc. In order to solve this technical problem, the current traditional solution is to add additional information of other dimensions in the decoding stage or post-processing stage of machine translation to replace the context information of the text to improve the translation accuracy of short texts, but the essence of this solution is The above is a secondary correction of the translation result. Since the additional information added is relatively simple and the matching accuracy with the text to be translated is not high, the translation accuracy of the short text is still relatively low; and this way of adding additional information will lead to When decoding and extracting information, the noise is large and cannot be directly used in the Transformer network, so it is not suitable for the current mainstream NMT-Transformer.
对于上述技术问题的解决,本申请提供了一种融合场景感知机器翻译方法,在NMT-Transformer的基础上,基于电子设备采集的场景感知数据(例如,电子设备采集的所在场景的位置数据、所在场景的噪声数据以所在场景的待翻译图片数据等)生成场景标签,再通过将生成的场景标签与待翻译文本共同作为源语言序列在Transformer网络的编码阶段进行融合编码并提取该源语言序列中的信息,最终在Transformer网络的解码阶段将该源语言序列中的信息转换成目标语言,解码得到符合待翻译文本场景的翻译结果。本申请通过在待翻译文本中融合场景标签的方式大大提高了场景化短文本的翻译准确率。To solve the above technical problems, the present application provides a fusion scene-aware machine translation method. On the basis of NMT-Transformer, based on scene-awareness data collected by electronic equipment (for example, The noise data of the scene is used to generate the scene label with the picture data to be translated in the scene, etc.), and then the generated scene label and the text to be translated are used together as the source language sequence. Finally, in the decoding stage of the Transformer network, the information in the source language sequence is converted into the target language, and the translation result conforming to the text scene to be translated is obtained by decoding. The present application greatly improves the translation accuracy of scene-based short texts by incorporating scene tags into the text to be translated.
应用了本申请的融合场景感知机器翻译方法的电子设备100或者电子设备200,能够基于电子设备100实时采集的场景感知数据生成相应的场景标签,并在翻译过程中将该场景标签作为源语言的一部分嵌入待翻译文本进行翻译编码及翻译解码,最终得到符合电子设备100所处场景的短文本翻译结果。The electronic device 100 or the electronic device 200 to which the fusion scene-aware machine translation method of the present application is applied can generate a corresponding scene label based on the scene-awareness data collected by the electronic device 100 in real time, and use the scene label as the source language in the translation process. A part of the text to be translated is embedded for translation encoding and translation decoding, and finally a short text translation result conforming to the scene where the electronic device 100 is located is obtained.
相较于传统的通过在机器翻译的解码阶段添加额外信息代替上下文信息来提高短文本翻译准确率的方式,本申请方案能够全面融合场景感知数据,使生成的场景标签全程参与机器翻译的编码阶段和解码阶段,因此能够稳定实现对翻译准确率的提高;本方案融合的场景标签相较于现有技术添加的额外信息更加多元化,场景标签的准确率更高并且不会由于某一场景感知数据的缺失(例如电子设备100上的某个感知元件没有开启等因素导致的数据缺失)而影响场景标签的判断,在一些实施例照中,如果某一场景感知数据缺失可以及时补充其他场景感知数据代替以生成准确的场景标签,因此,本方案融合的场景标签包含的场景特点信息质量更高并且使场景特点信息的嵌入更加灵活;本申请方案同时解决了直接在解码阶段添加额外信息导致解码提取得到的信息噪声大的问题,相应地也提高了用户体验。Compared with the traditional method of improving the accuracy of short text translation by adding additional information in the decoding stage of machine translation instead of context information, the solution of the present application can fully integrate scene perception data, so that the generated scene tags can participate in the encoding stage of machine translation throughout the process. Compared with the additional information added by the existing technology, the scene labels fused by this solution are more diversified, and the accuracy rate of scene labels is higher and will not be caused by a certain scene perception. The lack of data (for example, the lack of data caused by factors such as a certain sensing element on the electronic device 100 not being turned on) affects the judgment of the scene label. In some embodiments, if a certain scene sensing data is missing, other scene sensing can be supplemented in time. The data is replaced to generate accurate scene labels. Therefore, the scene characteristics information contained in the scene labels fused by this solution is of higher quality and the embedding of scene characteristics information is more flexible. The problem of large noise in the extracted information also improves the user experience accordingly.
可以理解,在本申请中,电子设备100包括但不限于膝上型计算机、平板计算机、手机、可穿戴设备、头戴式显示器、服务器、移动电子邮件设备、便携式游戏机、便携式音乐播放器、阅读器设备、其中嵌入或耦接有一个或多个处理器的电视机、或能够访问网络的其他中断电子设备。电子设备100能够通过自身的传感器、全球定位系统(Global Positioning System,GPS)、以及摄像头等采集场景感知数据,电子设备100还可以用来训练分类器以使其能够基于场景感 知数据生成场景标签。It will be understood that, in this application, electronic device 100 includes, but is not limited to, laptop computers, tablet computers, cell phones, wearable devices, head mounted displays, servers, mobile email devices, portable game consoles, portable music players, A reader device, a television with one or more processors embedded or coupled therein, or other disruptive electronic device capable of accessing a network. The electronic device 100 can collect scene perception data through its own sensors, a global positioning system (Global Positioning System, GPS), a camera, etc., and the electronic device 100 can also be used to train a classifier so that it can generate scene labels based on the scene perception data.
可以理解,电子设备200包括但不限于云端、服务器、膝上型计算机、台式计算机、平板计算机、以及其中嵌入或耦接有一个或多个处理器的能够访问网络的其他电子设备。It will be appreciated that electronic device 200 includes, but is not limited to, clouds, servers, laptops, desktops, tablet computers, and other electronic devices capable of accessing a network with one or more processors embedded or coupled therein.
为了便于说明,下文以电子设备100为手机、电子设备200为服务器为例,详细说明本申请的技术方案。For convenience of description, the technical solutions of the present application are described in detail below by taking the electronic device 100 as a mobile phone and the electronic device 200 as a server as an example.
下面结合图3详细介绍本申请方案的具体流程。如图3所示,本申请的融合场景感知机器翻译方法,包括以下步骤:The specific process of the solution of the present application will be described in detail below with reference to FIG. 3 . As shown in Figure 3, the fusion scene-aware machine translation method of the present application includes the following steps:
301:手机100获取待翻译文本、场景感知数据,并基于场景感知数据得到场景状态数据。301 : The mobile phone 100 obtains the text to be translated, scene perception data, and obtains scene state data based on the scene perception data.
待翻译文本的获取方式,包括通过手机100的输入界面直接输入待翻译文本,也可以通过手机100拍照或拍视频获取,待翻译文本还可以是用户语音指令识别转化得到的文本数据,在此不做限制。The acquisition method of the text to be translated includes directly inputting the text to be translated through the input interface of the mobile phone 100, or it can be obtained by taking pictures or videos of the mobile phone 100. The text to be translated can also be the text data obtained by the user's voice command recognition and transformation, which is not described here. make restrictions.
例如,手机100获取拍照得到照片或视频之后,手机100通过其自身的图像识别系统提取所拍照片、或提取视频中截取的图像中的文字信息,将其转换为待翻译文本。For example, after the mobile phone 100 obtains a photo or video by taking a photo, the mobile phone 100 extracts the text information in the photograph or the image captured from the video through its own image recognition system, and converts it into text to be translated.
手机100获取用户发出的语音指令,例如用户可以通过唤醒语音助手的方式向手机100发送语音指令,手机100通过其自身的人机对话系统识别用户语音指令中的文字信息,并将其转化为待翻译文本。The mobile phone 100 obtains the voice command issued by the user, for example, the user can send the voice command to the mobile phone 100 by waking up the voice assistant, and the mobile phone 100 recognizes the text information in the user's voice command through its own man-machine dialogue system, and converts it into the text information to be waiting. Translate text.
场景感知数据的获取方式,包括通过手机100的摄像头、麦克风、红外传感器或深度传感器等各种检测元件采集的图像、声音等数据。The acquisition method of the scene perception data includes data such as images and sounds collected by various detection elements such as a camera, a microphone, an infrared sensor or a depth sensor of the mobile phone 100 .
图4为本申请的融合场景感知机器翻译方法中的数据转化流程示意图。如图4所示,手机100可以通过麦克风、陀螺仪、加速度传感器、GPS、计算机视觉(Computer Viewer,CV)采集获取场景感知数据,手机100还可以通过手环或手表等智能穿戴设备上的健康检测传感器采集健康状态数据(例如,PPG传感器采集心率数据、血氧检测传感器采集血氧数据等)或通过手环或手表采集计步数据等,作为场景感知数据之一,在此不做限制。FIG. 4 is a schematic diagram of a data transformation process in the fusion scene-aware machine translation method of the present application. As shown in FIG. 4 , the mobile phone 100 can acquire scene perception data through microphone, gyroscope, acceleration sensor, GPS, and computer vision (Computer Viewer, CV). The detection sensor collects health status data (for example, PPG sensor collects heart rate data, blood oxygen detection sensor collects blood oxygen data, etc.) or collects pedometer data through a wristband or watch, as one of the scene perception data, which is not limited here.
另外,获取同一场景感知数据的方式可以有一种或多种,例如,获取位置信息可以通过上述GPS获取,也可以通过wifi信号获取定位信息的方法获取位置信息,在此不做限制。In addition, there may be one or more ways to acquire the perception data of the same scene. For example, the location information may be acquired through the above-mentioned GPS, or the location information may be acquired by the method of acquiring the location information through a wifi signal, which is not limited here.
进一步地,如图4所示,手机100可以基于上述采集到的场景感知数据,分析得到场景状态数据。现有技术中,可以在手机100中预置一个或多个场景状态数据的判定规则。Further, as shown in FIG. 4 , the mobile phone 100 can analyze and obtain scene state data based on the collected scene perception data. In the prior art, one or more judgment rules for scene state data may be preset in the mobile phone 100 .
作为判定规则的示例,手机100可以根据噪音类型或噪声等级判定手机100是否处于室内还是室外。举例来说,手机100可以设置噪音等级在2~4级范围内的噪音作为室内噪声的范围,设置4~6级范围内为室外噪声的范围。当手机100麦克风拾取的声音识别为2级噪声时,手机100分析得到的相应的场景状态数据为室内噪声。当手机100麦克风拾取的声音识别为5级噪声时,手机100分析得到的相应的场景状态数据为室外噪声。手机100可以设置建筑噪音(例如路面施工器械产生的施工噪声)、交通噪音(例如汽车鸣笛声、汽车发动机声音、轮胎与地面的摩擦声音等)为室外噪音,设置商场、机场、站台等公共场所的播放声音(例如通知播放、音乐播放等)为室内噪音,以及可以设置一些生活噪音(例如打麻将的声音、以及其他娱乐场所的噪音等)为室内噪声,上述噪音类型主要基于不同声音上述噪音的频率和声纹不同进行识别的,在一些实施例中,手机100也可以简单的将声音的频率范围设置为噪声类型的判断依据,在此不做限制。因此,手机100可以通过麦克风采集的声音(噪音)的噪音类型或噪音等级等 分析得到场景状态数据,是室内场景还是室外场景。As an example of the determination rule, the mobile phone 100 may determine whether the mobile phone 100 is indoors or outdoors according to the noise type or noise level. For example, the mobile phone 100 may set the noise with a noise level in the range of 2 to 4 as the range of indoor noise, and set the range of 4 to 6 as the range of outdoor noise. When the sound picked up by the microphone of the mobile phone 100 is identified as level 2 noise, the corresponding scene state data obtained by the analysis of the mobile phone 100 is indoor noise. When the sound picked up by the microphone of the mobile phone 100 is identified as level 5 noise, the corresponding scene state data obtained by the analysis of the mobile phone 100 is outdoor noise. The mobile phone 100 can set construction noise (such as construction noise generated by road construction equipment) and traffic noise (such as car whistle, car engine sound, tire friction sound, etc.) as outdoor noise, and set up public areas such as shopping malls, airports, and platforms. The sound played in the venue (such as notification playback, music playback, etc.) is indoor noise, and some living noises (such as the sound of playing mahjong, and noise from other entertainment venues, etc.) can be set as indoor noise. The above noise types are mainly based on different sounds. The frequency of the noise and the voiceprint are different for identification. In some embodiments, the mobile phone 100 may also simply set the frequency range of the sound as the basis for judging the type of noise, which is not limited herein. Therefore, the mobile phone 100 can obtain scene state data, whether it is an indoor scene or an outdoor scene, by analyzing the noise type or noise level of the sound (noise) collected by the microphone.
作为判定规则的示例,手机100可以根据GPS的位置信息以及在线地图的数据可以判定当前场景所在位置是商场还是车站等;手机100还可以基于GPS位置数据和CV拍摄待翻译的目标(文字或图片等,例如拍摄的是菜单)分析得到的场景状态数据是某餐馆或某商场里的餐馆。As an example of the determination rule, the mobile phone 100 can determine whether the location of the current scene is a shopping mall or a station based on the GPS location information and the data of the online map; the mobile phone 100 can also photograph the target (text or picture) to be translated based on the GPS location data and the CV. etc., for example, the menu is shot) The scene state data obtained by the analysis is a restaurant or a restaurant in a shopping mall.
在另一些实施例中,作为判定规则的示例,手机100还可以基于自身的陀螺仪、加速度传感器、GPS测量的位置数据以及与手机100连接的穿戴设备如手表测量的心率数据等分析得到走跑骑乘状态、计步数据及运动轨迹等这类场景状态数据。In other embodiments, as an example of the determination rule, the mobile phone 100 can also analyze the running and running based on its own gyroscope, acceleration sensor, position data measured by GPS, and heart rate data measured by a wearable device connected to the mobile phone 100, such as a watch. Scene status data such as riding status, pedometer data, and motion trajectories.
在另一些实施例中,作为判定规则的示例,手机100还可以基于CV采集的图像分析用户当前乘坐的交通工具场景,例如,CV采集的图像为地铁上的座椅、地铁报站显示屏界面、公交座椅、公交车内张贴的站点信息图示等,手机100可以确定用户所处场景为乘地铁或者乘公交的场景。In other embodiments, as an example of the determination rule, the mobile phone 100 may also analyze the vehicle scene currently taken by the user based on the images collected by the CV. For example, the images collected by the CV are the seats on the subway and the screen interface of the subway station announcement screen. , bus seats, site information icons posted in the bus, etc., the mobile phone 100 can determine that the scene where the user is located is the scene of taking the subway or taking the bus.
可以理解的是,在分析场景状态数据时,如果某一场景感知数据缺失,手机100可以调用其他场景感知数据代替缺失的场景感知数据,以确定场景状态数据。例如,手机100的GPS未开启时,手机100无法采集位置数据,手机100可以通过麦克风采集的声音以及红外传感器采集环境特征等数据分析得到场景状态数据。It can be understood that, when analyzing the scene state data, if a certain scene perception data is missing, the mobile phone 100 can call other scene perception data to replace the missing scene perception data to determine the scene state data. For example, when the GPS of the mobile phone 100 is not turned on, the mobile phone 100 cannot collect location data, and the mobile phone 100 can obtain scene state data by analyzing the sound collected by the microphone and the environmental characteristics collected by the infrared sensor.
场景状态数据可以基于一种或多种场景感知数据分析得到,一般如果是简单易区分的场景,手机100基于较少的场景感知数据就可以确定其场景状态数据,例如,对于车站或机场场景,手机100可能只需要融合GPS采集的位置信息,或麦克风采集到的车站内的声音类型等就可以判断基础场景状态。如果是有些比较复杂难以辨别的场景,手机100可能需要融合多种感知数据综合判断基础场景状态,在此不做限制。The scene state data can be obtained by analyzing one or more kinds of scene perception data. Generally, if it is a simple and easily distinguishable scene, the mobile phone 100 can determine its scene state data based on less scene perception data. For example, for a station or an airport scene, The mobile phone 100 may only need to integrate the location information collected by the GPS, or the type of sound in the station collected by the microphone, etc., to determine the state of the basic scene. If it is a scene that is relatively complicated and difficult to distinguish, the mobile phone 100 may need to integrate various sensing data to comprehensively judge the state of the basic scene, which is not limited here.
可以理解的是,上述不同的待翻译文本获取方式,可以对应不同的场景感知数据以及获取场景感知数据的方式,在此不做限制。It can be understood that the above-mentioned different ways of acquiring the text to be translated may correspond to different scene perception data and ways of acquiring the scene perception data, which are not limited here.
例如,手机100通过拍照或拍视频获取待翻译文本时,手机100获取的场景感知数据的方式可以包括场景特征物图像数据(例如通过CV采集图像数据等)、位置数据(例如通过手机陀螺仪采集角运动数据、手机GPS采集位置数据等);声音数据(例如通过麦克风采集声音数据等),等等。For example, when the mobile phone 100 obtains the text to be translated by taking pictures or videos, the method of scene perception data obtained by the mobile phone 100 may include scene feature image data (for example, image data collected by CV, etc.), location data (for example, collected by mobile phone gyroscope) angular motion data, location data collected by mobile phone GPS, etc.); sound data (for example, sound data collected through a microphone, etc.), and so on.
通过手动输入待翻译文本时,在这种情形下手机100的摄像头并不需要开启从而未开启,手机100可以不通过CV采集图像数据,手机100获取的场景感知数据的方式可以包括位置数据(例如采集陀螺仪测采集角运动数据、通过GPS采集位置数据等);声音数据(例如通过麦克风采集声音数据等)、运动数据(例如通过智能手表、智能手环采集心率数据、通过加速度传感器采集加速度数据等)、环境数据(例如通过温度传感器采集环境温度数据、通过环境光传感器采集环境光强度数据等),等等。可以理解的是,在某些场景下,虽然手机100界面上没有拍摄界面,手机100摄像头也可以在后台工作采集CV信号。When manually inputting the text to be translated, in this case the camera of the mobile phone 100 does not need to be turned on and thus is not turned on, the mobile phone 100 may not collect image data through CV, and the way of scene perception data obtained by the mobile phone 100 may include location data (for example, Gather gyroscope to measure and collect angular motion data, collect position data through GPS, etc.); sound data (such as collecting sound data through microphone, etc.), motion data (such as collecting heart rate data through smart watch, smart bracelet, collecting acceleration data through acceleration sensor) etc.), environmental data (eg, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.), and so on. It can be understood that, in some scenarios, although there is no shooting interface on the interface of the mobile phone 100, the camera of the mobile phone 100 can also work in the background to collect CV signals.
通过语音指令获取待翻译文本时,为了防止声音数据的相互干扰,获取场景感知数据,在这种情形下手机100的摄像头无需开启,手机100可以不通过CV采集图像数据,手机100获取的场景感知数据的方式可以包括位置数据(例如采集陀螺仪测量的角运动数据、通过GPS采集位置数据等);用户的运动状态数据(例如通过智能手表、智能手环等采集心率数据、血 氧数据等)、环境数据(例如通过温度传感器采集环境温度数据、通过环境光传感器采集环境光强度数据等),等等。可以理解的是,在某些场景下,虽然手机100界面上没有拍摄界面,手机100摄像头也可以在后台工作采集CV信号。When acquiring the text to be translated through voice commands, in order to prevent mutual interference of sound data and acquire scene perception data, in this case, the camera of the mobile phone 100 does not need to be turned on, and the mobile phone 100 does not need to collect image data through CV. The method of data may include location data (such as collecting angular motion data measured by gyroscope, collecting location data through GPS, etc.); user's movement state data (such as collecting heart rate data, blood oxygen data, etc. through smart watches, smart bracelets, etc.) , environmental data (for example, collecting ambient temperature data through a temperature sensor, collecting ambient light intensity data through an ambient light sensor, etc.), and so on. It can be understood that, in some scenarios, although there is no shooting interface on the interface of the mobile phone 100, the camera of the mobile phone 100 can also work in the background to collect CV signals.
可以理解,获取场景感知数据的设备与判定基础场景状态的设备可以是相同的电子设备(例如,手机100既采集场景感知数据也可以直接分析得到场景状态数据),也可以是不同的电子设备,例如,手机100采集的场景感知数据可以发送给服务器200进一步分析得到场景状态数据;或者通过手表、手环等智能穿戴设备采集场景感知数据,发送给手机100进一步分析得到场景状态数据,在此不做限制。It can be understood that the device that acquires the scene perception data and the device that determines the basic scene state may be the same electronic device (for example, the mobile phone 100 may not only collect the scene perception data but also directly analyze the scene state data), or may be different electronic devices, For example, the scene perception data collected by the mobile phone 100 can be sent to the server 200 for further analysis to obtain the scene state data; or the scene perception data can be collected through smart wearable devices such as watches and wristbands, and sent to the mobile phone 100 for further analysis to obtain the scene state data. make restrictions.
302:手机100基于得到的场景状态数据生成场景标签。302: The mobile phone 100 generates a scene label based on the obtained scene state data.
具体地,手机100对上述分析得到的场景状态数据进行分类并标注场景标签,不同的场景状态数据可以对应同一个场景标签。因此,可以理解,场景状态数据与场景标签的对应关系是多对一或者一对一的对应关系。Specifically, the mobile phone 100 classifies and labels the scene state data obtained by the above analysis, and different scene state data may correspond to the same scene label. Therefore, it can be understood that the correspondence between the scene state data and the scene label is a many-to-one or one-to-one correspondence.
手机100基于场景状态数据生成场景标签可以通过预训练的场景分类器完成。例如,手机100可以通过将场景状态数据输入梯度提升树(Gradient Boosting Decision Tree,GBDT)分类器中进行分类训练,并对分在相同或相近类别下的场景状态数据标注相同的场景标签。其中,可以理解,场景状态数据可以是特意采集的用于训练分类器的样本场景状态数据,也可以是机器翻译实际应用中分析得到的场景状态数据,并且,该场景状态数据可以随时间累加形成场景状态数据库,相应的场景标签也可以随时间累加形成场景标签库。由于分类器算法占用的存储空间比较小,对分类器的训练可以在手机100等电子设备上进行,也可以在服务器200上训练完成,在此不做限制。The generation of the scene label by the mobile phone 100 based on the scene state data may be accomplished through a pre-trained scene classifier. For example, the mobile phone 100 can perform classification training by inputting the scene state data into a Gradient Boosting Decision Tree (GBDT) classifier, and label the scene state data classified in the same or similar categories with the same scene label. Among them, it can be understood that the scene state data can be sample scene state data specially collected for training the classifier, or it can be the scene state data analyzed in the actual application of machine translation, and the scene state data can be accumulated over time to form The scene state database, and the corresponding scene tags can also be accumulated over time to form a scene tag library. Since the storage space occupied by the classifier algorithm is relatively small, the training of the classifier can be performed on an electronic device such as the mobile phone 100, or the training can be completed on the server 200, which is not limited here.
上述GBDT分类器是应用了GBDT算法的分类器,GBDT算法在传统机器学习算法里面是对真实分布拟合的最好的几种算法之一,既可以用于分类也可以用于回归,并且可以筛选特征。GBDT算法的原理是通过多轮迭代,每轮迭代产生一个弱分类器,每个分类器在上一轮分类器的残差基础上进行训练。对弱分类器的要求一般是足够简单,并且是低方差和高偏差的。因为训练的过程是通过降低偏差来不断提高最终分类器的精度。GBDT算法使用的决策树是CART回归树。训练时,GBDT分类器可以将场景状态数据进行分类,进而可以通过手机100或服务器200对分在相同或相近类别下的场景状态数据人工标注相同的场景标签,经过大量的场景状态样本数据训练得到的若干场景标签形成场景标签数据库。例如,手机100基于GPS数据可以得到的场景状态数据是某知名餐馆或某商场、基于麦克风采集的声音所属的噪音类型可以得到的场景状态数据(如室内噪音)、基于CV拍摄的目标物品的照片以及采集的周围环境的照片可以分析得到场景状态数据是菜单,综合上述场景状态数据可以确定当前场景为餐馆就餐中的菜单翻译场景,因此可以标注“餐馆”或者“restaurant”作为场景标签。The above GBDT classifier is a classifier that applies the GBDT algorithm. The GBDT algorithm is one of the best algorithms for fitting real distributions in traditional machine learning algorithms. It can be used for both classification and regression. Filter features. The principle of the GBDT algorithm is to generate a weak classifier through multiple rounds of iterations, and each classifier is trained on the basis of the residual of the previous round of classifiers. The requirements for weak classifiers are generally simple enough and low variance and high bias. Because the training process is to continuously improve the accuracy of the final classifier by reducing the bias. The decision tree used by the GBDT algorithm is the CART regression tree. During training, the GBDT classifier can classify the scene state data, and then manually label the scene state data in the same or similar categories through the mobile phone 100 or the server 200. The same scene label is obtained after training with a large number of scene state sample data. A number of scene tags form the scene tag database. For example, the scene state data that the mobile phone 100 can obtain based on GPS data is a well-known restaurant or a shopping mall, the scene state data (such as indoor noise) that can be obtained based on the noise type of the sound collected by the microphone, and the photo of the target item taken based on CV. And the collected photos of the surrounding environment can be analyzed to obtain that the scene state data is a menu. By combining the above scene state data, it can be determined that the current scene is a menu translation scene in restaurant dining, so “restaurant” or “restaurant” can be marked as the scene label.
可以理解,手机100或服务器200也可以训练其他分类算法模型使其能够对应场景状态数据生成场景标签,其他分类算法包括但不限于支持向量机(Support Vector Machine,SVM)算法、逻辑回归(Logistic Regress,LR)算法、迭代二叉树3代(Iterative Dichotomiser 3,ID3)决策树算法等,在此不做限制。It can be understood that the mobile phone 100 or the server 200 can also train other classification algorithm models so that they can generate scene labels corresponding to the scene state data. Other classification algorithms include but are not limited to support vector machine (Support Vector Machine, SVM) algorithms, logistic regression (Logistic Regress) , LR) algorithm, Iterative Dichotomiser 3 (ID3) decision tree algorithm, etc., which are not limited here.
303:手机100通过NMT-Transformer中的编码器将上述场景标签与待翻译文本共同作为源语言序列进行编码,并提取该源语言序列中的场景和待翻译文本信息。303 : The mobile phone 100 encodes the above-mentioned scene tag and the text to be translated together as a source language sequence through an encoder in the NMT-Transformer, and extracts the scene and to-be-translated text information in the source language sequence.
具体地,将场景标签和待翻译文本输入NMT-Transformer的编码器(Transformer网络中的编码层)进行编码,其中,Transformer网络中的编码层是由多层自注意力(self-attention)网络实现的,其中,每层self-attention网络输出的注意力向量将作为下一层self-attention网络的输入。Specifically, the scene label and the text to be translated are input into the encoder of the NMT-Transformer (the encoding layer in the Transformer network) for encoding, wherein the encoding layer in the Transformer network is implemented by a multi-layer self-attention network. , where the attention vector output by each layer of self-attention network will be used as the input of the next layer of self-attention network.
在将场景标签和待翻译文本输入编码器进行编码的过程中,对于如何将场景标签嵌入待翻译文本需要考虑两部分:一是考虑场景标签的内容与待翻译文本中的哪类文本的相关度更高;二是考虑场景标签嵌入待翻译文本中的位置应与相关度更高的文本内容距离更接近。In the process of inputting the scene label and the text to be translated into the encoder for encoding, two parts need to be considered for how to embed the scene label into the text to be translated: one is to consider the correlation between the content of the scene label and the type of text in the text to be translated The second is to consider that the position where the scene label is embedded in the text to be translated should be closer to the text content with higher relevance.
图5示出了一种编码过程在待翻译文本中嵌入场景标签的过程示意图。如图5所示,例如在菜单翻译场景中,X 5、X 6、X 7、X 8表示的是组成场景标签的词,X 9、X 10、X 11、X 12表示的是组成待翻译文本的词。其中,待翻译文本是所拍的菜单转化得到的文本数据,场景标签是餐馆或restaurant。其中,菜单中的菜名与场景标签的相关度更高,而菜品的价格与场景标签的相关度较低或没有相关度。因此,在将场景标签和待翻译文本输入Transformer网络时,应将场景标签的输入置于菜名文本输入之前,使得场景标签与菜名文本更接近,例如菜单上的每一行文本中都是菜名+价格+菜品介绍,因此可以在菜单上的每一行文本输入Transformer网络之前输入场景标签“restaurant”。可以理解,与场景标签距离越近的待翻译文本,Transformer网络在对其编码提取文本信息的同时对场景标签提取的场景信息也越多,即场景标签对该文本的注意力越大,如图5所示,用强相关曲线表示;反之,与场景标签距离越远的待翻译文本,Transformer网络在对其编码提取文本信息的同时对场景标签提取的场景信息也越少,即场景标签对该文本的注意力越小,用弱相关曲线表示。 FIG. 5 shows a schematic diagram of the process of embedding scene tags in the text to be translated in an encoding process. As shown in Figure 5, for example, in a menu translation scenario, X 5 , X 6 , X 7 , and X 8 represent words that constitute the scene label, and X 9 , X 10 , X 11 , and X 12 represent words that constitute the to-be-translated word for text. Among them, the text to be translated is the text data obtained by converting the photographed menu, and the scene label is a restaurant or restaurant. Among them, the dish name in the menu has a higher correlation with the scene label, while the price of the dish has a low or no correlation with the scene label. Therefore, when inputting the scene label and the text to be translated into the Transformer network, the input of the scene label should be placed before the input of the dish name text, so that the scene label and the dish name text are closer. For example, each line of text on the menu is a dish Name + price + description of the dish, so you can enter the scene label "restaurant" before each line of text on the menu is entered into the Transformer network. It can be understood that the closer the distance to the scene label to the text to be translated, the more scene information the Transformer network extracts from the scene label while encoding and extracting text information, that is, the greater the attention of the scene label to the text, as shown in the figure As shown in 5, it is represented by a strong correlation curve; on the contrary, the farther the text to be translated is from the scene label, the less scene information the Transformer network extracts from the scene label while encoding and extracting the text information, that is, the scene label is the same. The smaller the attention of the text, is represented by the weak correlation curve.
例如,菜单上的文本数据BATTERED WHITING 16.0M|19.0NM,将restaurant在其之前输入,则输入Transformer网络的源语言序列为:Restaurant BATTERED WHITING 16.0M|19.0NM。其中,Restaurant与BATTERED WHITING之间的距离更接近,Restaurant对BATTERED WHITING的注意力较大,对其翻译结果的影响也较大,在Restaurant的影响下,BATTERED WHITING被翻译为炸鳕鱼,而不是之前错误的翻译为遭受重创的鳕鱼;而Restaurant与16.0M|19.0NM之间的距离较远,注意力小,对其翻译结果的影响也小,翻译结果保持不变。For example, the text data BATTERED WHITING 16.0M|19.0NM on the menu, and restaurant is input before it, the source language sequence of the input Transformer network is: Restaurant BATTERED WHITING 16.0M|19.0NM. Among them, the distance between Restaurant and BATTERED WHITING is closer. Restaurant pays more attention to BATTERED WHITING and has a greater impact on its translation results. Under the influence of Restaurant, BATTERED WHITING is translated as fried cod instead of before The wrong translation is the battered cod; while the distance between Restaurant and 16.0M|19.0NM is far away, the attention is small, and the impact on its translation result is also small, and the translation result remains unchanged.
304:手机100通过NMT-Transformer中的解码器将编码阶段从场景标签和待翻译文本中提取的场景信息及待翻译文本信息按词解码出以目标语言表达的译文,输出翻译结果。304 : The mobile phone 100 decodes the scene information and the to-be-translated text information extracted from the scene label and the to-be-translated text in the encoding phase by using the decoder in the NMT-Transformer to decode the translated text expressed in the target language by word, and outputs the translation result.
具体地,解码器(Transformer网络中的解码层)基于编码器在编码过程中提取的场景信息选择目标语言进行解码得到译文。其中,Transformer网络中的解码层也是由多层self-attention网络实现的。Specifically, the decoder (the decoding layer in the Transformer network) selects the target language for decoding based on the scene information extracted by the encoder during the encoding process to obtain the translation. Among them, the decoding layer in the Transformer network is also implemented by a multi-layer self-attention network.
可以理解,由于在NMT-Transformer的编码阶段,Transformer网络已经在提取待翻译文本信息的同时提取嵌入的场景标签中的场景信息,因此在NMT-Transformer的解码阶段,Transformer网络可以直接基于场景标签中的场景信息选择目标语言进行解码,从而得到更加符合场景的译文。It can be understood that since in the encoding stage of NMT-Transformer, the Transformer network has already extracted the scene information in the embedded scene label while extracting the text information to be translated. Therefore, in the decoding stage of NMT-Transformer, the Transformer network can be directly based on the scene label. The scene information of the target language is selected for decoding, so as to obtain a translation that is more in line with the scene.
图6为本申请的一种场景化短文本翻译结果的界面对比示意图。其中,如图6(a)所示,为传统翻译装置或翻译设备的翻译结果界面,对于待翻译菜名Fisherman's Basket最终解码出的译文是:渔夫的篮子,这显然是错误的翻译结果;如图6(b)所示,为应用了本申请的融 合场景感知机器翻译方法的翻译设备的翻译结果界面,基于场景标签(restaurant)提取的场景信息,对菜品Fisherman's Basket解码得到的译文是海鲜拼盘,翻译结果正确。FIG. 6 is a schematic diagram of interface comparison of a scene-based short text translation result according to the present application. Among them, as shown in Figure 6(a), it is the translation result interface of a traditional translation device or translation equipment. The final decoded translation of the dish name Fisherman's Basket to be translated is: Fisherman's Basket, which is obviously a wrong translation result; As shown in FIG. 6(b), for the translation result interface of the translation device using the fusion scene-aware machine translation method of the present application, based on the scene information extracted from the scene tag (restaurant), the translation obtained by decoding the dish Fisherman's Basket is a seafood platter , the translation result is correct.
下面结合图7介绍另一种实施场景。Another implementation scenario is described below with reference to FIG. 7 .
图7为本申请的另一种场景化短文本翻译结果的界面对比示意图。如图7所示,是对出入境场景中填写的入境卡的翻译结果显示。其中,图7(a)所示待翻译文本为入境卡(报关卡)的源语言文字;图7(b)所示为传统翻译设备的翻译结果;图7(c)所示为应用了本申请的融合场景感知机器翻译方法的翻译设备的翻译结果。FIG. 7 is a schematic diagram of interface comparison of another scene-based short text translation result according to the present application. As shown in Figure 7, it is a display of the translation result of the entry card filled in the entry and exit scene. Among them, the text to be translated shown in Figure 7(a) is the source language of the entry card (customs declaration card); Figure 7(b) shows the translation result of the traditional translation device; Figure 7(c) shows the application of this The translation result of the translation device that applies the fusion scene-aware machine translation method.
结合图3及其相关描述,上述应用本申请的融合场景感知机器翻译方法翻译得到图7(c)所示的翻译结果的过程包括以下步骤:In conjunction with Fig. 3 and its related description, the above-mentioned process of applying the fusion scene-aware machine translation method of the present application to obtain the translation result shown in Fig. 7(c) includes the following steps:
S1:获取待翻译文本、场景感知数据,并基于场景感知数据得到场景状态数据。S1: Acquire the text to be translated, scene perception data, and obtain scene state data based on the scene perception data.
具体地,一方面通过打开手机100的摄像头拍摄出入境卡页面,手机100通过其自身的图像识别系统提取所拍出入境卡上的待翻译文本。Specifically, on the one hand, by turning on the camera of the mobile phone 100 to photograph the page of the entry-exit card, the mobile phone 100 extracts the text to be translated on the photographed entry-exit card through its own image recognition system.
另一方面,手机100通过GPS采集位置数据、麦克风采集声音数据、CV采集环境特征图像数据等作为场景感知数据。基于采集到的场景感知数据对应得到场景状态数据,例如通过GPS采集的位置数据判断所在位置或附近已标定的地理位置标记是机场还是海关等,通过麦克风采集的声音数据判断所处环境是室内环境还是室外环境,以及通过CV采集环境特征图像数据判断所处环境是否有登记窗口、登记表格等。如果有些出入境场景对CV限制使用,则也可以不通过CV采集图像数据作为场景感知数据,通过其他的场景感知数据得到场景状态数据即可。手机100采集的场景感知数据的数量及类型在此不做限制。具体参考上述步骤301及相关描述,在此不再赘述。On the other hand, the mobile phone 100 collects position data through GPS, sound data through microphone, and environmental characteristic image data through CV as scene perception data. Based on the collected scene perception data, the scene status data is correspondingly obtained. For example, the location data collected by GPS is used to determine whether the location or the nearby marked geographic location mark is an airport or a customs, etc., and the sound data collected by the microphone is used to determine whether the environment is an indoor environment. It is still an outdoor environment, and the CV collects environmental characteristic image data to determine whether there is a registration window, registration form, etc. in the environment. If the use of CV is restricted in some entry and exit scenarios, it is also possible not to collect image data through CV as scene perception data, and to obtain scene state data through other scene perception data. The quantity and type of scene perception data collected by the mobile phone 100 are not limited herein. Specific reference is made to the foregoing step 301 and related descriptions, which will not be repeated here.
S2:基于上述S1中得到的场景状态数据生成场景标签为出入境。手机100中已经训练好的分类器可以直接根据上述得到的场景状态数据快速生成出入境场景标签。具体参考上述步骤302及相关描述,在此不再赘述。S2: Based on the scene state data obtained in the above S1, the scene label is generated as entry and exit. The classifier that has been trained in the mobile phone 100 can quickly generate entry and exit scene labels directly according to the scene state data obtained above. For details, refer to the above step 302 and related descriptions, which will not be repeated here.
S3:手机100通过NMT-Transformer中的编码器将上述生成的出入境场景标签与待翻译文本共同作为源语言序列进行编码,并提取该源语言序列中的出入境场景信息和待翻译文本信息。具体编码过程参考上述步骤303及相关描述,在此不再赘述。S3: The mobile phone 100 uses the encoder in the NMT-Transformer to encode the above-generated entry-exit scene label and the text to be translated together as a source language sequence, and extracts entry-exit scene information and to-be-translated text information in the source language sequence. For the specific encoding process, refer to the above-mentioned step 303 and related descriptions, which will not be repeated here.
S4:手机100通过NMT-Transformer中的解码器将编码阶段从出入境场景标签和待翻译文本中提取的出入境场景信息及待翻译文本信息按词解码出以目标语言表达的译文,输出翻译结果,如图7(c)所示。具体解码过程参考上述步骤304及相关描述,在此不再赘述。S4: The mobile phone 100 uses the decoder in the NMT-Transformer to decode the entry-exit scene information and the to-be-translated text information extracted from the entry-exit scene label and the to-be-translated text in the encoding phase to decode the translated text expressed in the target language by word, and output the translation result , as shown in Figure 7(c). For the specific decoding process, refer to the above-mentioned step 304 and related descriptions, which will not be repeated here.
在如图7(c)所示的翻译结果中,对出入境卡上有一些专业用语,例如,Please print in capital letters,准确地翻译为请用大写字符填写,其中,print正确翻译为填写。In the translation result shown in Figure 7(c), there are some professional terms on the immigration card, for example, Please print in capital letters, which is accurately translated as please fill in with uppercase characters, and print is correctly translated as fill in.
相比之下,在图7(b)所示的传统翻译结果中,将上述句子Please print in capital letters,翻译为请用大写字母打印,显然是错误的。因此,融合了场景标签(出入境场景)之后的翻译结果更为准确,用户体验更好。In contrast, in the traditional translation result shown in Figure 7(b), it is obviously wrong to translate the above sentence Please print in capital letters. Therefore, the translation result after incorporating scene labels (entry and exit scenes) is more accurate and the user experience is better.
如上所述,在实际应用中,手机100上可以嵌入应用了本申请的融合场景感知机器翻译方法的应用程序实现对场景化短文本的准确翻译;手机100上也可以通过安装应用软件,并通过与服务器200的交互将待翻译文本发送给服务器200,由服务器200基于融合场景感知机器翻译方法完成翻译后向手机100反馈翻译结果;手机100还可以通过自身浏览器访问网页版翻译 引擎,并通过与服务器200的交互将待翻译文本发送给服务器200,由服务器200基于融合场景感知机器翻译方法完成翻译后向手机100反馈翻译结果,在此不做限制。As mentioned above, in practical applications, the mobile phone 100 can be embedded with an application to which the fusion scene-aware machine translation method of the present application is applied to achieve accurate translation of scene-based short texts; application software can also be installed on the mobile phone 100, The interaction with the server 200 sends the text to be translated to the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method; The interaction with the server 200 sends the text to be translated to the server 200, and the server 200 feeds back the translation result to the mobile phone 100 after completing the translation based on the fusion scene-aware machine translation method, which is not limited here.
下面结合本申请的实施例给出电子设备100的一种示例性结构。An exemplary structure of the electronic device 100 is given below in conjunction with the embodiments of the present application.
图8示出了根据本申请实施例的手机100的结构示意图。FIG. 8 shows a schematic structural diagram of a mobile phone 100 according to an embodiment of the present application.
手机100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and user Identity module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本发明实施例示意的结构并不构成对手机100的具体限定。在本申请另一些实施例中,手机100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
其中,处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal Image signal processor (ISP), controller, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
在本申请的实施例中,手机100可以通过处理器110训练场景分类器以及训练NMT-Transformer的编码器和解码器,并且在实际的场景化短文本翻译过程中,由处理器110对手机100获取的场景感知数据和待翻译文本进行处理并执行上述步骤301~304所述的融合场景感知机器翻译方法。在一些实施例中,处理器110可以包括一个或多个接口。In the embodiment of the present application, the mobile phone 100 can train the scene classifier and the encoder and decoder of the NMT-Transformer through the processor 110, and during the actual scene-based short text translation The acquired scene perception data and the text to be translated are processed and the fusion scene perception machine translation method described in the above steps 301 to 304 is executed. In some embodiments, the processor 110 may include one or more interfaces.
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对手机100的结构限定。在本申请另一些实施例中,手机100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that, the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过手机100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。The charging management module 140 is used to receive charging input from the charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130 . In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through the wireless charging coil of the mobile phone 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141 接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140 to supply power to the processor 110 , the internal memory 121 , the display screen 194 , the camera 193 , and the wireless communication module 160 . The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.
手机100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。手机100通过上述无线通信功能与服务器200之间实现通信及数据传输等。The wireless communication function of the mobile phone 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like. The mobile phone 100 implements communication and data transmission with the server 200 through the above-mentioned wireless communication function.
天线1和天线2用于发射和接收电磁波信号。手机100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。 Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
移动通信模块150可以提供应用在手机100上的包括2G/3G/4G/5G等无线通信的解决方案。无线通信模块160可以提供应用在手机100上的包括无线局域网(wireless local area networks,WLAN),如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone 100 . The wireless communication module 160 can provide applications on the mobile phone 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
在一些实施例中,手机100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得手机100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。In some embodiments, the antenna 1 of the mobile phone 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the mobile phone 100 can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc. The GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (GLONASS), a Beidou navigation satellite system (BDS), a quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
手机100通过GPU、显示屏194、以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。手机100对场景化短文本的待翻译文本采集的图像或文字通过上述显示屏194显示,手机100对待翻译文本的翻译结果也通过上述显示屏194显示,以反馈给用户。The mobile phone 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information. The image or text collected by the mobile phone 100 for the text to be translated of the scene-based short text is displayed on the display screen 194, and the translation result of the text to be translated by the mobile phone 100 is also displayed on the display screen 194 for feedback to the user.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。手机100可以包括1个或N个显示屏194,N为大于1的正整数。Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The mobile phone 100 may include one or N display screens 194 , where N is a positive integer greater than one.
SIM卡接口195用于连接SIM卡。The SIM card interface 195 is used to connect a SIM card.
手机100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。手机100采集CV信号也可以通过上述拍摄功能实现,即通过上述拍摄功能实现采集所处场景的图像或待翻译文本所在的图像。The mobile phone 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor. The collection of the CV signal by the mobile phone 100 can also be realized by the above-mentioned shooting function, that is, the image of the scene or the image of the text to be translated is collected by the above-mentioned shooting function.
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,手机100可以包括1个或N个摄像头193,N为大于1的正整数。Camera 193 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the mobile phone 100 may include one or N cameras 193 , where N is a positive integer greater than one.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展手机100的存储能力。内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储手机100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行手机100的各种功能应用以及数据处理。在本申请的实施例中,处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行本申请的融合场景感知机器翻译方法。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 . Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor. In the embodiment of the present application, the processor 110 executes the fusion scene-aware machine translation method of the present application by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.
手机100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。手机100可以设置至少一个麦克风170C。在另一些实施例中,手机100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,手机100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。在本申请实施过程中,可以通过麦克风170C采集声音信号并对采集的声音信号确定噪音等级或噪音类型等,以进一步分析场景状态数据,例如是在室内还是在室外。The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The mobile phone 100 may be provided with at least one microphone 170C. In other embodiments, the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions. During the implementation of the present application, the microphone 170C can collect sound signals and determine the noise level or noise type of the collected sound signals, so as to further analyze the scene state data, such as whether it is indoors or outdoors.
耳机接口170D用于连接有线耳机。The earphone jack 170D is used to connect wired earphones.
陀螺仪传感器180B可以用于确定手机100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定手机100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测手机100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消手机100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。The gyroscope sensor 180B can be used to determine the motion attitude of the mobile phone 100 . In some embodiments, the angular velocity of cell phone 100 about three axes (ie, x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyroscope sensor 180B detects the shaking angle of the mobile phone 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the mobile phone 100 through reverse motion to realize anti-shake. The gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
加速度传感器180E可检测手机100在各个方向上(一般为三轴)加速度的大小。当手机100静止时可检测出重力的大小及方向。还可以用于识别手机姿态,应用于横竖屏切换,计步器等应用。在本申请实施过程中,可以基于陀螺仪传感器180B测量的抖动状态数据以及加速度传感器180E测得的加速度数据分析得到某些场景状态数据,例如用户的走跑骑乘状态等。The acceleration sensor 180E can detect the magnitude of the acceleration of the mobile phone 100 in various directions (generally three axes). When the mobile phone 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to recognize the posture of mobile phones, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc. During the implementation of the present application, certain scene state data, such as the user's walking, running and riding state, can be obtained by analyzing the shaking state data measured by the gyro sensor 180B and the acceleration data measured by the acceleration sensor 180E.
距离传感器180F,用于测量距离。手机100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,手机100可以利用距离传感器180F测距以实现快速对焦。Distance sensor 180F for measuring distance. The cell phone 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the mobile phone 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
环境光传感器180L用于感知环境光亮度。手机100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测手机100是否在口袋里,以防误触。在本申请实施过程中,可以基于环境光传感器180L感知到的环境光亮度分析场景状态数据,例如,判断当前所处场景是室内还是室外等。The ambient light sensor 180L is used to sense ambient light brightness. The mobile phone 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the mobile phone 100 is in the pocket, so as to prevent accidental touch. In the implementation process of the present application, the scene state data may be analyzed based on the ambient light brightness sensed by the ambient light sensor 180L, for example, to determine whether the current scene is indoor or outdoor.
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。手机100可以接收按键输入,产生与手机100的用户设置以及功能控制有关的键信号输入。The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The cell phone 100 can receive key input and generate key signal input related to user settings and function control of the cell phone 100 .
手机100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明手机100的软件结构。The software system of the mobile phone 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present invention take an Android system with a layered architecture as an example to illustrate the software structure of the mobile phone 100 as an example.
图9是本发明实施例的手机100的软件结构框图。FIG. 9 is a block diagram of a software structure of a mobile phone 100 according to an embodiment of the present invention.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
应用程序层可以包括一系列应用程序包。The application layer can include a series of application packages.
如图9所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。As shown in Figure 9, the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.
如图9所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in Figure 9, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
电话管理器用于提供手机100的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide the communication function of the mobile phone 100 . For example, the management of call status (including connecting, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,手机100振动,指示灯闪烁等。The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the mobile phone 100 vibrates, and the indicator light flashes.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。A system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。三维图形处理库用于实现三维图形绘图,图像渲染,合成和图层处理等。2D图形引擎是2D绘图的绘图引擎。The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc. The 3D graphics processing library is used to realize 3D graphics drawing, image rendering, compositing and layer processing, etc. 2D graphics engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
下面结合菜单翻译场景,示例性说明手机100软件以及硬件的工作流程。The workflow of the software and hardware of the mobile phone 100 is exemplarily described below with reference to the menu translation scenario.
当触摸传感器180K接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括打开翻译软件,或打开摄像头193等操作)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过摄像头193捕获待翻译菜单的静态图像或视频。When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is sent to the kernel layer. The kernel layer processes touch operations into original input events (including opening translation software, or opening the camera 193 and other operations). Raw input events are stored at the kernel layer. The application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, as an example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer, and then starts the camera driver by calling the kernel layer. The camera 193 captures a still image or video of the menu to be translated.
在说明书对“一个实施例”或“实施例”的引用意指结合实施例所描述的具体特征、结构或特性被包括在根据本申请公开的至少一个范例实施方案或技术中。说明书中的各个地方的短语“在一个实施例中”的出现不一定全部指代同一个实施例。Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one example embodiment or technique disclosed in accordance with this application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.
本申请公开还涉及用于执行文本中的操作装置。该装置可以专门处于所要求的目的而构造或者其可以包括被存储在计算机中的计算机程序选择性地激活或者重新配置的通用计算机。这样的计算机程序可以被存储在计算机可读介质中,诸如,但不限于任何类型的盘,包括软盘、光盘、CD-ROM、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、EPROM、EEPROM、磁或光卡、专用集成电路(ASIC)或者适于存储电子指令的任何类型的介质,并且每个可以被耦合到计算机系统总线。此外,说明书中所提到的计算机可以包括单个处理器或者可以是采用针对增加的计算能力的多个处理器涉及的架构。The present disclosure also relates to apparatuses for performing operations in text. This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.
本文所提出的过程和显示器固有地不涉及任何具体计算机或其他装置。各种通用系统也可以与根据本文中的教导的程序一起使用,或者构造更多专用装置以执行一个或多个方法步骤可以证明是方便的。在一下描述中讨论了用于各种这些系统的结构。另外,可以使用足以实现本申请公开的技术和实施方案的任何具体编程语言。各种编程语言可以被用于实施本公开,如本文所讨论的。The processes and displays presented herein are not inherently related to any specific computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform one or more method steps. Architectures for various of these systems are discussed in the following description. Additionally, any specific programming language sufficient to implement the techniques and embodiments disclosed herein may be used. Various programming languages may be used to implement the present disclosure, as discussed herein.
另外,在本说明书所使用的语言已经主要被选择用于可读性和指导性的目的并且可能未被选择为描绘或限制所公开的主题。因此,本申请公开旨在说明而非限制本文所讨论的概念的范围。Additionally, the language used in this specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or limit the disclosed subject matter. Accordingly, the present disclosure is intended to illustrate, but not to limit, the scope of the concepts discussed herein.

Claims (13)

  1. 一种融合场景感知机器翻译方法,用于具有机器翻译功能的电子设备,其特征在于,所述方法包括:A fusion scene-aware machine translation method for an electronic device with a machine translation function, characterized in that the method includes:
    获取待翻译文本以及场景感知数据,所述场景感知数据由所述电子设备采集得到,并用于确定所述电子设备所处的场景;acquiring the text to be translated and scene perception data, the scene perception data being collected by the electronic device and used to determine the scene where the electronic device is located;
    根据所述场景感知数据,确定所述电子设备所处的场景;determining the scene in which the electronic device is located according to the scene perception data;
    基于所述电子设备所处的场景,生成与所述场景对应的场景标签;generating a scene tag corresponding to the scene based on the scene where the electronic device is located;
    将所述场景标签与所述待翻译文本共同作为源语言序列输入用于翻译的编码器进行编码,得到融合场景感知的编码数据;Inputting the scene tag and the text to be translated together as a source language sequence into an encoder for translation for encoding, to obtain encoded data of fusion scene perception;
    将融合场景感知的编码数据通过用于翻译的解码器进行解码及目标语言转换,得到融合场景感知的翻译结果。The encoded data of the fusion scene perception is decoded and the target language is converted by the decoder for translation, and the translation result of fusion scene perception is obtained.
  2. 根据权利要求1所述的方法,其特征在于,根据所述场景感知数据,确定所述电子设备所处的场景,包括:The method according to claim 1, wherein, determining the scene in which the electronic device is located according to the scene perception data, comprising:
    根据所述场景感知数据确定所述电子设备所处的场景的特点,以得到场景状态数据,所述场景状态数据用于表征所述电子设备所处的场景;并且The characteristics of the scene in which the electronic device is located are determined according to the scene perception data to obtain scene state data, where the scene state data is used to represent the scene in which the electronic device is located; and
    对所述场景状态数据进行分类统计,确定所述电子设备所处的场景。Classify statistics on the scene state data to determine the scene in which the electronic device is located.
  3. 根据权利要求2所述的方法,其特征在于,所述场景感知数据由所述电子设备中设置的检测元件采集得到,所述检测元件包括GPS元件、摄像头、麦克风、传感器中的至少一个。The method according to claim 2, wherein the scene perception data is collected by a detection element provided in the electronic device, and the detection element comprises at least one of a GPS element, a camera, a microphone, and a sensor.
  4. 根据权利要求3所述的方法,其特征在于,所述场景感知数据包括位置数据、图像数据、声音数据、加速度数据、环境温度数据、环境光强度数据及角运动数据中的一种或多种。The method according to claim 3, wherein the scene perception data comprises one or more of position data, image data, sound data, acceleration data, ambient temperature data, ambient light intensity data and angular motion data .
  5. 根据权利要求4所述的方法,其特征在于,根据所述场景感知数据确定所述电子设备所处的场景的特点,包括以下情形的一种或多种:The method according to claim 4, wherein the characteristics of the scene in which the electronic device is located are determined according to the scene perception data, including one or more of the following situations:
    根据所述位置数据,确定所述场景的位置名称;determining the location name of the scene according to the location data;
    根据所述图像数据中的文字、目标物中的一种或多种确定所述场景中的特征文字或特征物,并确定所述场景的环境特征;According to one or more of the text in the image data and the target, determine the characteristic text or characteristic objects in the scene, and determine the environmental characteristics of the scene;
    根据所述声音数据中的频率、声纹、幅值中的一种或多种确定所述场景中的噪音类型或噪音等级,并确定所述场景属于室内或室外;Determine the noise type or noise level in the scene according to one or more of the frequency, voiceprint, and amplitude in the sound data, and determine that the scene belongs to indoors or outdoors;
    根据所述加速度数据、所述角运动数据,确定所述电子设备在所述场景中的运动状态;determining the motion state of the electronic device in the scene according to the acceleration data and the angular motion data;
    根据所述环境温度数据、所述环境光强度数据,确定所述场景的温度级别及光强度级别,并确定所述场景属于室内或室外。According to the ambient temperature data and the ambient light intensity data, the temperature level and the light intensity level of the scene are determined, and it is determined that the scene belongs to indoors or outdoors.
  6. 根据权利要求5所述的方法,其特征在于,根据所述场景感知数据,确定所述电子设备所处的场景,还包括:The method according to claim 5, wherein determining the scene in which the electronic device is located according to the scene perception data, further comprising:
    根据所述场景感知数据确定用户运动状态,所述用户运动状态用于确定所述场景的特点;Determine a user motion state according to the scene perception data, where the user motion state is used to determine the characteristics of the scene;
    其中,所述场景感知数据包括心率数据、血氧数据中的一种或多种。Wherein, the scene perception data includes one or more of heart rate data and blood oxygen data.
  7. 根据权利要求6所述的方法,其特征在于,所述场景标签与所述待翻译文本输入所述编码器的顺序基于所述待翻译文本中的文本内容与所述场景标签之间的相关度确定;The method according to claim 6, wherein the order in which the scene tags and the text to be translated are input into the encoder is based on the degree of correlation between the text content in the text to be translated and the scene tags Sure;
    所述待翻译文本中的文本内容与所述场景标签的相关度越大,所述待翻译文本中的文本内容与所述场景标签的输入距离越接近。The greater the correlation between the text content in the text to be translated and the scene tag, the closer the input distance between the text content in the text to be translated and the scene tag is.
  8. 根据权利要求7所述的方法,其特征在于,所述融合场景感知的编码数据包括编码过程中所述编码器提取的所述场景标签中的场景特点信息及所述待翻译文本中的文本内容信息,并且The method according to claim 7, wherein the encoded data of the fusion scene perception comprises scene feature information in the scene label extracted by the encoder during the encoding process and text content in the to-be-translated text information, and
    所述编码器按照所述场景标签与所述待翻译文本输入所述编码器的顺序提取所述场景特点信息和所述文本内容信息。The encoder extracts the scene feature information and the text content information according to the sequence in which the scene label and the text to be translated are input into the encoder.
  9. 根据权利要求8所述的方法,其特征在于,所述解码器基于所述场景特点信息在所述目标语言中选择与所述文本内容信息相对应的词,生成所述融合场景感知的翻译结果。The method according to claim 8, wherein the decoder selects a word corresponding to the text content information in the target language based on the scene feature information, and generates the translation result of the fusion scene perception .
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述场景状态数据生成场景标签通过分类器实现,所述编码器和所述解码器通过神经网络模型实现。The method according to claim 9, wherein the generating of the scene label based on the scene state data is implemented by a classifier, and the encoder and the decoder are implemented by a neural network model.
  11. 根据权利要求10所述的方法,其特征在于,所述分类器通过分类算法实现对所述场景状态数据的分类计算,所述分类算法包括梯度提升树分类算法、支持向量机算法、逻辑回归算法、迭代二叉树3代决策树算法中的任意一种;The method according to claim 10, wherein the classifier implements the classification calculation of the scene state data through a classification algorithm, and the classification algorithm includes a gradient boosting tree classification algorithm, a support vector machine algorithm, and a logistic regression algorithm , any one of the three generation decision tree algorithms of iterative binary tree;
    所述神经网络模型包括基于Transformer网络的循环神经网络机器翻译模型。The neural network model includes a cyclic neural network machine translation model based on Transformer network.
  12. 一种可读介质,其特征在于,所述可读介质上存储有指令,该指令在电子设备上执行时使电子设备执行权利要求1至11中任一项所述的方法。A readable medium, characterized in that an instruction is stored on the readable medium, and when the instruction is executed on an electronic device, the electronic device executes the method according to any one of claims 1 to 11 .
  13. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    存储器,用于存储由电子设备的一个或多个处理器执行的指令,以及memory for storing instructions for execution by one or more processors of the electronic device, and
    处理器,是电子设备的处理器之一,用于执行权利要求1至11中任一项所述的方法。The processor, which is one of the processors of the electronic device, is used to execute the method of any one of claims 1 to 11 .
PCT/CN2021/119655 2020-10-10 2021-09-22 Fusion scene perception machine translation method, storage medium, and electronic device WO2022073417A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011079936.XA CN114330374A (en) 2020-10-10 2020-10-10 Fusion scene perception machine translation method, storage medium and electronic equipment
CN202011079936.X 2020-10-10

Publications (1)

Publication Number Publication Date
WO2022073417A1 true WO2022073417A1 (en) 2022-04-14

Family

ID=81032960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119655 WO2022073417A1 (en) 2020-10-10 2021-09-22 Fusion scene perception machine translation method, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN114330374A (en)
WO (1) WO2022073417A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220138439A1 (en) * 2020-11-04 2022-05-05 Adobe Inc. Multi-lingual tagging for digital images
CN114821257A (en) * 2022-04-26 2022-07-29 中国科学院大学 Intelligent processing method, device and equipment for video stream and natural language in navigation
CN115312029A (en) * 2022-10-12 2022-11-08 之江实验室 Voice translation method and system based on voice depth characterization mapping

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391839A (en) * 2014-11-13 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for machine translation
CN109074242A (en) * 2016-05-06 2018-12-21 电子湾有限公司 Metamessage is used in neural machine translation
CN110263353A (en) * 2019-06-25 2019-09-20 北京金山数字娱乐科技有限公司 A kind of machine translation method and device
US20200034436A1 (en) * 2018-07-26 2020-01-30 Google Llc Machine translation using neural network models
CN111709431A (en) * 2020-06-15 2020-09-25 厦门大学 Instant translation method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391839A (en) * 2014-11-13 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for machine translation
CN109074242A (en) * 2016-05-06 2018-12-21 电子湾有限公司 Metamessage is used in neural machine translation
US20200034436A1 (en) * 2018-07-26 2020-01-30 Google Llc Machine translation using neural network models
CN110263353A (en) * 2019-06-25 2019-09-20 北京金山数字娱乐科技有限公司 A kind of machine translation method and device
CN111709431A (en) * 2020-06-15 2020-09-25 厦门大学 Instant translation method and device, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220138439A1 (en) * 2020-11-04 2022-05-05 Adobe Inc. Multi-lingual tagging for digital images
US11645478B2 (en) * 2020-11-04 2023-05-09 Adobe Inc. Multi-lingual tagging for digital images
CN114821257A (en) * 2022-04-26 2022-07-29 中国科学院大学 Intelligent processing method, device and equipment for video stream and natural language in navigation
CN114821257B (en) * 2022-04-26 2024-04-05 中国科学院大学 Intelligent processing method, device and equipment for video stream and natural language in navigation
CN115312029A (en) * 2022-10-12 2022-11-08 之江实验室 Voice translation method and system based on voice depth characterization mapping

Also Published As

Publication number Publication date
CN114330374A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2020151387A1 (en) Recommendation method based on user exercise state, and electronic device
WO2022073417A1 (en) Fusion scene perception machine translation method, storage medium, and electronic device
WO2021244457A1 (en) Video generation method and related apparatus
US20220176200A1 (en) Method for Assisting Fitness and Electronic Apparatus
WO2021104485A1 (en) Photographing method and electronic device
WO2022052776A1 (en) Human-computer interaction method, and electronic device and system
WO2023125335A1 (en) Question and answer pair generation method and electronic device
CN112214636A (en) Audio file recommendation method and device, electronic equipment and readable storage medium
CN112383664B (en) Device control method, first terminal device, second terminal device and computer readable storage medium
CN111105788B (en) Sensitive word score detection method and device, electronic equipment and storage medium
CN111564152A (en) Voice conversion method and device, electronic equipment and storage medium
CN114242037A (en) Virtual character generation method and device
WO2022037479A1 (en) Photographing method and photographing system
WO2023179490A1 (en) Application recommendation method and an electronic device
CN115437601B (en) Image ordering method, electronic device, program product and medium
CN113723397A (en) Screen capturing method and electronic equipment
CN113468929A (en) Motion state identification method and device, electronic equipment and storage medium
CN113538321A (en) Vision-based volume measurement method and terminal equipment
CN114822543A (en) Lip language identification method, sample labeling method, model training method, device, equipment and storage medium
CN114943976A (en) Model generation method and device, electronic equipment and storage medium
CN114080258B (en) Motion model generation method and related equipment
CN113742460B (en) Method and device for generating virtual roles
WO2021036562A1 (en) Prompting method for fitness training, and electronic device
CN112988984A (en) Feature acquisition method and device, computer equipment and storage medium
CN114547429A (en) Data recommendation method and device, server and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21876935

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21876935

Country of ref document: EP

Kind code of ref document: A1