CN116306694A

CN116306694A - Multi-mode machine translation method based on pre-training

Info

Publication number: CN116306694A
Application number: CN202310079488.0A
Authority: CN
Inventors: 田二林; 李祖贺; 李璞; 吴怀广; 梁维德; 朱增超; 张赛
Original assignee: Zhengzhou Light Industry Technology Research Institute Co ltd; Zhengzhou University of Light Industry
Current assignee: Zhengzhou Light Industry Technology Research Institute Co ltd; Zhengzhou University of Light Industry
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-06-23

Abstract

The invention discloses a multi-mode machine translation method based on pre-training, which comprises the following steps: step S10: preprocessing a webpage, and collecting a webpage sample library based on visual characteristics through a network; step S20: collecting the characteristics of a webpage sample library, and forming a key content characteristic database through webpage segmentation with self-adaptive sensing granularity; step S30: performing data preprocessing on the key content feature database; step S40: and judging the webpage data information by utilizing a classifier according to the result of the key content feature extraction, so as to judge the sensing data information and the type of the sensing data information. According to the invention, through combining the structural information in the reverse-mapped DOM tree with the webpage visual information by the HTML tag, the characteristics of the key content DOM are comprehensively acquired, the segmentation granularity of the webpage segmentation algorithm is adaptively controlled, so that the segmented data information is more effectively close to the real situation of the sensing information, and the judgment of the sensing information block is effectively realized by using the classifier.

Description

Multi-mode machine translation method based on pre-training

Technical Field

The invention relates to the technical field of data processing, in particular to a multi-mode machine translation method based on pre-training.

Background

Multimodal machine translation is a multimodal task that introduces pictorial information corresponding to text into the traditional machine translation process. Unlike the development history of machine translation for decades, the multi-modal machine translation problem has been developed for only a few years as a continuation of the traditional neural machine translation problem. The current multi-modal machine translation task aims at supplementing additional information and enhancing translation by using picture information matched with bilingual texts, and is a cross-modal and cross-domain research task. For the picture part of the multi-modal machine translation, it is difficult to train a picture feature extractor from the head due to the limited size of the existing multi-modal machine translation dataset. The existing methods are based on pre-trained models, such as Resnet, RCNN and the like, and extract picture features so as to obtain sufficient picture content representation capability. For the text portion, the current data size is sufficient to train an excellent translation model from scratch, so the existing discussion tends to ignore the impact of the text module as the core on multi-modal machine translation. In practice, many translation errors originate from the text translation itself, so that the picture is required to provide additional correct information to supplement.

With the wide use of the internet, web pages become an important carrier for users to obtain information. In the existing search engine, web crawler software is utilized to capture web pages, key contents in the web pages need to be analyzed, non-key contents such as advertisements, navigation bars, user comments and the like in the web pages are removed, and abstracts of target web pages are provided for users. With the complexity and diversification of webpage design and the further popularization of webpage dynamic rendering technology, many key contents are often dynamically added through JavaScript codes, and the traditional method for detecting the key contents based on the label analysis of static HTML codes cannot adapt to increasingly complex webpage design technology. With the advent of web page lamination, dynamic rendering and other technologies, key content cannot be well identified by only relying on basic features and semantic features of labels; for an unknown webpage, the existence of the sensing data information in the webpage is identified by what means, the related condition of the contained sensing data information is accurately acquired, the automatic identification accuracy of the sensing data information is guaranteed, the working pressure of manual judgment is greatly reduced, the calculation of the sensing data information quantity of the webpage can be more effectively realized, and the accuracy of the sensing information is further improved.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a multi-mode machine translation method based on pre-training.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a multi-modal machine translation method based on pre-training, comprising:

step S10: preprocessing a webpage, and collecting a webpage sample library based on visual characteristics through a network;

step S20: collecting the characteristics of a webpage sample library, and forming a key content characteristic database through webpage segmentation with self-adaptive sensing granularity;

step S30: performing data preprocessing on the key content feature database;

step S40: and judging the webpage data information by utilizing a classifier according to the result of the key content feature extraction, so as to judge the sensing data information and the type of the sensing data information.

Preferably, in step S10, collecting a web page sample library based on visual features, including dynamically rendering an HTML file by utilizing a chrome-header, manually labeling key contents in a web page, parsing out all visual HTML tags, storing the tags in a tag set, extracting DOM tree structure information and visual information corresponding to the tags, node depth information, node visual position and size information, and collecting DOM components labeled as key contents and DOM components labeled as non-key contents as an initial sample set.

Preferably, in step S20, the features of the initial sample set include visual features such as DOM component position, size, color, and the like, and labels, sub-components, and the like.

Preferably, in step S20, the web page segmentation separation of the primary adaptive sensing granularity includes a choice judgment of DOM component samples, wherein the choice judgment process firstly performs classification judgment according to the number of DOM components, and sequentially takes the overlapping areas of a plurality of DOM components and the size of the visual area as judgment basis; for node pairs containing multiple DOM components, selecting to be reserved as a key content DOM component sample as long as the DOM components show sensing characteristics; otherwise, the similarity among the DOM components needs to be judged, and if the information content or the visual style contained by the DOM components are inconsistent, the DOM components are selected to be discarded and used as non-key content DOM component samples.

Preferably, in step S30, according to the dynamic rendering result of the chrome-head software, traversing the DOM component in the current page, extracting visual features such as position, size and color of the target DOM component, basic features such as labels and sub-components, regularizing the above information to form a feature matrix of the DOM component, and storing the feature matrix in a database; and processing the feature matrix acquired by the feature extraction module, removing the non-influence features, and preliminarily obtaining the feature group with the maximum correlation with the key content.

Preferably, in step S40, the key content feature extraction is to input the feature vector of the feature database into the trained classifier to perform accuracy detection through a machine learning algorithm, obtain the accuracy of the training sample under different proportions, and select the model with the highest accuracy as the final model, thereby obtaining the decision result.

Compared with the prior art, the invention has the beneficial effects that: through combining structural information in the anti-mapping DOM tree with webpage visual information by using HTML (hypertext markup language) tag pairs, 1600 webpage samples are acquired, and the webpage samples are rendered by using chrome-head software, so that the characteristics of key content DOM can be acquired more comprehensively, the segmentation granularity of a webpage segmentation algorithm is controlled in a self-adaptive manner, the relative depth information and the visual mapping condition among different tag pairs can be effectively adapted to the segmentation granularity of a sensing information block in the webpage segmentation process, the segmentation precision and the segmentation effectiveness are improved, and the segmented data information is more effectively close to the real condition of the sensing information; by comparing through using various general algorithms, the adaptability of each algorithm in the aspect of key content detection can be more accurately described, the change among the collected webpage pictures at different time points on a preset time axis is analyzed, the visual information of the pixel points where the change occurs is extracted, and the judgment of the sensing information block is effectively realized by using a classifier.

Drawings

FIG. 1 is a block flow diagram of a multi-modal machine translation method based on pre-training in accordance with the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a multi-modal machine translation method based on pre-training includes:

step S30: performing data preprocessing on the key content feature database;

Specifically, in step S10, a web page sample library based on visual features is collected, including dynamically rendering an HTML file by utilizing a chrome-head, manually labeling key contents in a web page, analyzing all visual HTML tags, storing the tags in a tag set, extracting DOM tree structure information and visual information corresponding to the tags, node depth information, node visual position and size information, and collecting DOM components labeled as key contents and DOM components labeled as non-key contents as initial sample sets; 800 samples of key content DOM components and 800 samples of non-key content DOM components.

Specifically, in step S20, the features of the initial sample set include the DOM component position, size, color, and other visual features and the basic features of the tag, sub-component, and the like, the log-type features are normalized to eliminate dimensions, and Shan Re vector representations are constructed for the log-type features, forming a feature matrix, and stored in a database.

Specifically, in step S20, the web page segmentation separation of the primary adaptive sensing granularity includes a decision of accepting or rejecting a DOM component sample, where the decision of accepting or rejecting is firstly performed according to the number of DOM components, and overlapping areas and visual area sizes of multiple DOM components are sequentially taken as decision basis; for node pairs containing multiple DOM components, selecting to be reserved as a key content DOM component sample as long as the DOM components show sensing characteristics; otherwise, the similarity among the DOM components needs to be judged, and if the information content or the visual style contained by the DOM components are inconsistent, the DOM components are selected to be discarded and used as non-key content DOM component samples.

Specifically, in step S30, according to the dynamic rendering result of the chrome-head software, traversing the DOM component in the current page, extracting the visual features such as the position, the size, the color, and the like of the target DOM component and the basic features such as the tag and the sub-component, regularizing the above information to form a feature matrix of the DOM component, and storing the feature matrix in a database; and processing the feature matrix acquired by the feature extraction module, removing the non-influence features, and preliminarily obtaining the feature group with the maximum correlation with the key content.

Specifically, in step S40, the key content feature extraction is to input the feature vector of the feature database into the trained classifier to perform accuracy detection through a machine learning algorithm, obtain the accuracy of the training sample under different proportions, and select the model with the highest accuracy as the final model, so as to obtain the decision result.

To sum up: according to the invention, the structural information in the reverse-mapped DOM tree and the webpage visual information are combined by the HTML label pairs, 1600 webpage samples are collected, and the color-head software is used for rendering, so that the characteristics of the key content DOM can be collected more comprehensively, the segmentation granularity of a webpage segmentation algorithm is controlled in a self-adaptive manner, the relative depth information and the visual mapping condition between different label pairs can be effectively adapted to the segmentation granularity of a sensing information block in the webpage segmentation process, the segmentation precision and the segmentation effectiveness are improved, and the segmented data information is more effectively close to the real condition of the sensing information; by comparing through using various general algorithms, the adaptability of each algorithm in the aspect of key content detection can be more accurately described, the change among the collected webpage pictures at different time points on a preset time axis is analyzed, the visual information of the pixel points where the change occurs is extracted, and the judgment of the sensing information block is effectively realized by using a classifier.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. A multi-modal machine translation method based on pre-training, comprising:

step S30: performing data preprocessing on the key content feature database;

step S40: judging the web page data information by utilizing a classifier according to the result of the key content feature extraction, realizing the judgment of the sensing data information and the judgment of the sensing data information type,

2. The multi-modal machine translation method according to claim 1, wherein in step S20, the web page segmentation separation of the primary adaptive sensing granularity includes a decision of selecting and judging DOM component samples, wherein the decision of selecting and judging is firstly performed according to the number of DOM components, and overlapping areas and visual area sizes of a plurality of DOM components are sequentially taken as judgment basis; for node pairs containing multiple DOM components, selecting to be reserved as a key content DOM component sample as long as the DOM components show sensing characteristics; otherwise, the similarity among the DOM components needs to be judged, and if the information content or the visual style contained by the DOM components are inconsistent, the DOM components are selected to be discarded and used as non-key content DOM component samples.

3. The multi-modal machine translation method according to claim 1, wherein in step S30, according to the dynamic rendering result of the chrome-head software, traversing the DOM component in the current page, extracting the visual features such as the position, size and color of the target DOM component and the basic features such as the tag and sub-component, regularizing the above information to form a feature matrix of the DOM component, and storing in a database; and processing the feature matrix acquired by the feature extraction module, removing the non-influence features, and preliminarily obtaining the feature group with the maximum correlation with the key content.

4. The multi-modal machine translation method based on pre-training according to claim 1, wherein in step S40, the key content feature extraction is to input the feature vector of the feature database into the trained classifier to perform accuracy detection through a machine learning algorithm, obtain the accuracy of the training sample under different proportions, and select the model with the highest accuracy as the final model, so as to obtain the decision result.