CN116306694A - Multi-mode machine translation method based on pre-training - Google Patents

Multi-mode machine translation method based on pre-training Download PDF

Info

Publication number
CN116306694A
CN116306694A CN202310079488.0A CN202310079488A CN116306694A CN 116306694 A CN116306694 A CN 116306694A CN 202310079488 A CN202310079488 A CN 202310079488A CN 116306694 A CN116306694 A CN 116306694A
Authority
CN
China
Prior art keywords
dom
webpage
information
visual
key content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310079488.0A
Other languages
Chinese (zh)
Inventor
田二林
李祖贺
李璞
吴怀广
梁维德
朱增超
张赛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Light Industry Technology Research Institute Co ltd
Zhengzhou University of Light Industry
Original Assignee
Zhengzhou Light Industry Technology Research Institute Co ltd
Zhengzhou University of Light Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Light Industry Technology Research Institute Co ltd, Zhengzhou University of Light Industry filed Critical Zhengzhou Light Industry Technology Research Institute Co ltd
Priority to CN202310079488.0A priority Critical patent/CN116306694A/en
Publication of CN116306694A publication Critical patent/CN116306694A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a multi-mode machine translation method based on pre-training, which comprises the following steps: step S10: preprocessing a webpage, and collecting a webpage sample library based on visual characteristics through a network; step S20: collecting the characteristics of a webpage sample library, and forming a key content characteristic database through webpage segmentation with self-adaptive sensing granularity; step S30: performing data preprocessing on the key content feature database; step S40: and judging the webpage data information by utilizing a classifier according to the result of the key content feature extraction, so as to judge the sensing data information and the type of the sensing data information. According to the invention, through combining the structural information in the reverse-mapped DOM tree with the webpage visual information by the HTML tag, the characteristics of the key content DOM are comprehensively acquired, the segmentation granularity of the webpage segmentation algorithm is adaptively controlled, so that the segmented data information is more effectively close to the real situation of the sensing information, and the judgment of the sensing information block is effectively realized by using the classifier.

Description

Multi-mode machine translation method based on pre-training
Technical Field
The invention relates to the technical field of data processing, in particular to a multi-mode machine translation method based on pre-training.
Background
Multimodal machine translation is a multimodal task that introduces pictorial information corresponding to text into the traditional machine translation process. Unlike the development history of machine translation for decades, the multi-modal machine translation problem has been developed for only a few years as a continuation of the traditional neural machine translation problem. The current multi-modal machine translation task aims at supplementing additional information and enhancing translation by using picture information matched with bilingual texts, and is a cross-modal and cross-domain research task. For the picture part of the multi-modal machine translation, it is difficult to train a picture feature extractor from the head due to the limited size of the existing multi-modal machine translation dataset. The existing methods are based on pre-trained models, such as Resnet, RCNN and the like, and extract picture features so as to obtain sufficient picture content representation capability. For the text portion, the current data size is sufficient to train an excellent translation model from scratch, so the existing discussion tends to ignore the impact of the text module as the core on multi-modal machine translation. In practice, many translation errors originate from the text translation itself, so that the picture is required to provide additional correct information to supplement.
With the wide use of the internet, web pages become an important carrier for users to obtain information. In the existing search engine, web crawler software is utilized to capture web pages, key contents in the web pages need to be analyzed, non-key contents such as advertisements, navigation bars, user comments and the like in the web pages are removed, and abstracts of target web pages are provided for users. With the complexity and diversification of webpage design and the further popularization of webpage dynamic rendering technology, many key contents are often dynamically added through JavaScript codes, and the traditional method for detecting the key contents based on the label analysis of static HTML codes cannot adapt to increasingly complex webpage design technology. With the advent of web page lamination, dynamic rendering and other technologies, key content cannot be well identified by only relying on basic features and semantic features of labels; for an unknown webpage, the existence of the sensing data information in the webpage is identified by what means, the related condition of the contained sensing data information is accurately acquired, the automatic identification accuracy of the sensing data information is guaranteed, the working pressure of manual judgment is greatly reduced, the calculation of the sensing data information quantity of the webpage can be more effectively realized, and the accuracy of the sensing information is further improved.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a multi-mode machine translation method based on pre-training.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a multi-modal machine translation method based on pre-training, comprising:
step S10: preprocessing a webpage, and collecting a webpage sample library based on visual characteristics through a network;
step S20: collecting the characteristics of a webpage sample library, and forming a key content characteristic database through webpage segmentation with self-adaptive sensing granularity;
step S30: performing data preprocessing on the key content feature database;
step S40: and judging the webpage data information by utilizing a classifier according to the result of the key content feature extraction, so as to judge the sensing data information and the type of the sensing data information.
Preferably, in step S10, collecting a web page sample library based on visual features, including dynamically rendering an HTML file by utilizing a chrome-header, manually labeling key contents in a web page, parsing out all visual HTML tags, storing the tags in a tag set, extracting DOM tree structure information and visual information corresponding to the tags, node depth information, node visual position and size information, and collecting DOM components labeled as key contents and DOM components labeled as non-key contents as an initial sample set.
Preferably, in step S20, the features of the initial sample set include visual features such as DOM component position, size, color, and the like, and labels, sub-components, and the like.
Preferably, in step S20, the web page segmentation separation of the primary adaptive sensing granularity includes a choice judgment of DOM component samples, wherein the choice judgment process firstly performs classification judgment according to the number of DOM components, and sequentially takes the overlapping areas of a plurality of DOM components and the size of the visual area as judgment basis; for node pairs containing multiple DOM components, selecting to be reserved as a key content DOM component sample as long as the DOM components show sensing characteristics; otherwise, the similarity among the DOM components needs to be judged, and if the information content or the visual style contained by the DOM components are inconsistent, the DOM components are selected to be discarded and used as non-key content DOM component samples.
Preferably, in step S30, according to the dynamic rendering result of the chrome-head software, traversing the DOM component in the current page, extracting visual features such as position, size and color of the target DOM component, basic features such as labels and sub-components, regularizing the above information to form a feature matrix of the DOM component, and storing the feature matrix in a database; and processing the feature matrix acquired by the feature extraction module, removing the non-influence features, and preliminarily obtaining the feature group with the maximum correlation with the key content.
Preferably, in step S40, the key content feature extraction is to input the feature vector of the feature database into the trained classifier to perform accuracy detection through a machine learning algorithm, obtain the accuracy of the training sample under different proportions, and select the model with the highest accuracy as the final model, thereby obtaining the decision result.
Compared with the prior art, the invention has the beneficial effects that: through combining structural information in the anti-mapping DOM tree with webpage visual information by using HTML (hypertext markup language) tag pairs, 1600 webpage samples are acquired, and the webpage samples are rendered by using chrome-head software, so that the characteristics of key content DOM can be acquired more comprehensively, the segmentation granularity of a webpage segmentation algorithm is controlled in a self-adaptive manner, the relative depth information and the visual mapping condition among different tag pairs can be effectively adapted to the segmentation granularity of a sensing information block in the webpage segmentation process, the segmentation precision and the segmentation effectiveness are improved, and the segmented data information is more effectively close to the real condition of the sensing information; by comparing through using various general algorithms, the adaptability of each algorithm in the aspect of key content detection can be more accurately described, the change among the collected webpage pictures at different time points on a preset time axis is analyzed, the visual information of the pixel points where the change occurs is extracted, and the judgment of the sensing information block is effectively realized by using a classifier.
Drawings
FIG. 1 is a block flow diagram of a multi-modal machine translation method based on pre-training in accordance with the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a multi-modal machine translation method based on pre-training includes:
step S10: preprocessing a webpage, and collecting a webpage sample library based on visual characteristics through a network;
step S20: collecting the characteristics of a webpage sample library, and forming a key content characteristic database through webpage segmentation with self-adaptive sensing granularity;
step S30: performing data preprocessing on the key content feature database;
step S40: and judging the webpage data information by utilizing a classifier according to the result of the key content feature extraction, so as to judge the sensing data information and the type of the sensing data information.
Specifically, in step S10, a web page sample library based on visual features is collected, including dynamically rendering an HTML file by utilizing a chrome-head, manually labeling key contents in a web page, analyzing all visual HTML tags, storing the tags in a tag set, extracting DOM tree structure information and visual information corresponding to the tags, node depth information, node visual position and size information, and collecting DOM components labeled as key contents and DOM components labeled as non-key contents as initial sample sets; 800 samples of key content DOM components and 800 samples of non-key content DOM components.
Specifically, in step S20, the features of the initial sample set include the DOM component position, size, color, and other visual features and the basic features of the tag, sub-component, and the like, the log-type features are normalized to eliminate dimensions, and Shan Re vector representations are constructed for the log-type features, forming a feature matrix, and stored in a database.
Specifically, in step S20, the web page segmentation separation of the primary adaptive sensing granularity includes a decision of accepting or rejecting a DOM component sample, where the decision of accepting or rejecting is firstly performed according to the number of DOM components, and overlapping areas and visual area sizes of multiple DOM components are sequentially taken as decision basis; for node pairs containing multiple DOM components, selecting to be reserved as a key content DOM component sample as long as the DOM components show sensing characteristics; otherwise, the similarity among the DOM components needs to be judged, and if the information content or the visual style contained by the DOM components are inconsistent, the DOM components are selected to be discarded and used as non-key content DOM component samples.
Specifically, in step S30, according to the dynamic rendering result of the chrome-head software, traversing the DOM component in the current page, extracting the visual features such as the position, the size, the color, and the like of the target DOM component and the basic features such as the tag and the sub-component, regularizing the above information to form a feature matrix of the DOM component, and storing the feature matrix in a database; and processing the feature matrix acquired by the feature extraction module, removing the non-influence features, and preliminarily obtaining the feature group with the maximum correlation with the key content.
Specifically, in step S40, the key content feature extraction is to input the feature vector of the feature database into the trained classifier to perform accuracy detection through a machine learning algorithm, obtain the accuracy of the training sample under different proportions, and select the model with the highest accuracy as the final model, so as to obtain the decision result.
To sum up: according to the invention, the structural information in the reverse-mapped DOM tree and the webpage visual information are combined by the HTML label pairs, 1600 webpage samples are collected, and the color-head software is used for rendering, so that the characteristics of the key content DOM can be collected more comprehensively, the segmentation granularity of a webpage segmentation algorithm is controlled in a self-adaptive manner, the relative depth information and the visual mapping condition between different label pairs can be effectively adapted to the segmentation granularity of a sensing information block in the webpage segmentation process, the segmentation precision and the segmentation effectiveness are improved, and the segmented data information is more effectively close to the real condition of the sensing information; by comparing through using various general algorithms, the adaptability of each algorithm in the aspect of key content detection can be more accurately described, the change among the collected webpage pictures at different time points on a preset time axis is analyzed, the visual information of the pixel points where the change occurs is extracted, and the judgment of the sensing information block is effectively realized by using a classifier.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims (4)

1. A multi-modal machine translation method based on pre-training, comprising:
step S10: preprocessing a webpage, and collecting a webpage sample library based on visual characteristics through a network;
step S20: collecting the characteristics of a webpage sample library, and forming a key content characteristic database through webpage segmentation with self-adaptive sensing granularity;
step S30: performing data preprocessing on the key content feature database;
step S40: judging the web page data information by utilizing a classifier according to the result of the key content feature extraction, realizing the judgment of the sensing data information and the judgment of the sensing data information type,
preferably, in step S10, collecting a web page sample library based on visual features, including dynamically rendering an HTML file by utilizing a chrome-header, manually labeling key contents in a web page, parsing out all visual HTML tags, storing the tags in a tag set, extracting DOM tree structure information and visual information corresponding to the tags, node depth information, node visual position and size information, and collecting DOM components labeled as key contents and DOM components labeled as non-key contents as an initial sample set.
2. The multi-modal machine translation method according to claim 1, wherein in step S20, the web page segmentation separation of the primary adaptive sensing granularity includes a decision of selecting and judging DOM component samples, wherein the decision of selecting and judging is firstly performed according to the number of DOM components, and overlapping areas and visual area sizes of a plurality of DOM components are sequentially taken as judgment basis; for node pairs containing multiple DOM components, selecting to be reserved as a key content DOM component sample as long as the DOM components show sensing characteristics; otherwise, the similarity among the DOM components needs to be judged, and if the information content or the visual style contained by the DOM components are inconsistent, the DOM components are selected to be discarded and used as non-key content DOM component samples.
3. The multi-modal machine translation method according to claim 1, wherein in step S30, according to the dynamic rendering result of the chrome-head software, traversing the DOM component in the current page, extracting the visual features such as the position, size and color of the target DOM component and the basic features such as the tag and sub-component, regularizing the above information to form a feature matrix of the DOM component, and storing in a database; and processing the feature matrix acquired by the feature extraction module, removing the non-influence features, and preliminarily obtaining the feature group with the maximum correlation with the key content.
4. The multi-modal machine translation method based on pre-training according to claim 1, wherein in step S40, the key content feature extraction is to input the feature vector of the feature database into the trained classifier to perform accuracy detection through a machine learning algorithm, obtain the accuracy of the training sample under different proportions, and select the model with the highest accuracy as the final model, so as to obtain the decision result.
CN202310079488.0A 2023-01-29 2023-01-29 Multi-mode machine translation method based on pre-training Pending CN116306694A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310079488.0A CN116306694A (en) 2023-01-29 2023-01-29 Multi-mode machine translation method based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310079488.0A CN116306694A (en) 2023-01-29 2023-01-29 Multi-mode machine translation method based on pre-training

Publications (1)

Publication Number Publication Date
CN116306694A true CN116306694A (en) 2023-06-23

Family

ID=86795031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310079488.0A Pending CN116306694A (en) 2023-01-29 2023-01-29 Multi-mode machine translation method based on pre-training

Country Status (1)

Country Link
CN (1) CN116306694A (en)

Similar Documents

Publication Publication Date Title
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
RU2666277C1 (en) Text segmentation
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
CN104915420B (en) Knowledge base data processing method and system
CN111753120A (en) Method and device for searching questions, electronic equipment and storage medium
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
CN113326413A (en) Webpage information extraction method, system, server and storage medium
CN111680669A (en) Test question segmentation method and system and readable storage medium
CN114881043A (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN113205046A (en) Method, system, device and medium for identifying question book
Meetei et al. Extraction and identification of manipuri and mizo texts from scene and document images
CN116306694A (en) Multi-mode machine translation method based on pre-training
CN114579796A (en) Machine reading understanding method and device
Madan et al. Parsing and summarizing infographics with synthetically trained icon detection
CN113936186A (en) Content identification method and device, electronic equipment and readable storage medium
CN113468889A (en) Method and device for extracting model information based on BERT pre-training
AU2018100324B4 (en) Image Analysis
Bhowmik Document Region Classification
CN115277211B (en) Text and image-based multi-mode pornography and gambling domain name automatic detection method
CN112200184B (en) Calligraphy area detection and author identification method in natural scene
Umatia et al. Text Recognition from Images
Westphal Efficient Document Image Binarization Using Heterogeneous Computing and Interactive Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination