CN117648410A

CN117648410A - Multi-language text data analysis system and method

Info

Publication number: CN117648410A
Application number: CN202410123385.4A
Authority: CN
Inventors: 孙兆洋; 隋媛
Original assignee: China National Institute of Standardization
Current assignee: China National Institute of Standardization
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-03-05
Anticipated expiration: 2044-01-30
Also published as: CN117648410B

Abstract

The invention discloses a multilingual text data analysis system and a multilingual text data analysis method, which relate to the technical field of smart city information text processing. When the translation is unqualified, the region with the translation error can be accurately positioned, powerful support is provided for subsequent correction, and a rough processing mode of integral replacement of the traditional method is avoided. Through the analysis framework modeling and automatic classification module, the system can automatically identify key topics and entities of the smart city text, realize automatic classification and topic extraction of the multi-language text, and provide more visual and efficient information reference for city managers and decision makers. By applying the word quantization and matching module, the system associates text data to the smart city field, realizes automatic identification and classification of field specific information, and enables analysis results to be more professional and specific.

Description

Multi-language text data analysis system and method

Technical Field

The invention relates to the technical field of smart city information text processing, in particular to a multilingual text data analysis system and method.

Background

The development of smart cities has been widespread in information and data, including multi-lingual text data such as social media text, event reports, news stories, and promotional text. The data bear key information of smart city operation, management and decision making, and are important to realizing city intelligence, improving resident life quality and promoting sustainable development. In this context, fully mining and utilizing the value of these multilingual text data is an urgent need. However, in conventional text processing, especially in the face of multilingual situations, unqualified translation quality problems can lead to serious consequences such as information distortion, misleading decisions, etc. When the traditional text processing finds that the translation is unqualified, a whole replacement mode is often adopted, and the accurate positioning of the error position is lacking; in addition, the traditional text processing method is low in efficiency and cannot be classified rapidly in the face of large-scale multilingual text data.

However, conventional text processing presents a series of problems in facing multi-language scenarios. First, inadequate translation quality may lead to distortion of information, misleading decisions, and even affecting the overall direction of urban development. When the traditional text processing finds that the translation is unqualified, an integral replacement mode is generally adopted, the processing mode lacks of accurately positioning the error position, and targeted restoration is difficult to realize, so that the accuracy and reliability of data processing are affected.

The inefficiency of conventional text processing methods is also pronounced when processing large-scale multilingual text data. The smart city data volume is huge, and the traditional method cannot meet the requirements of rapid classification and processing, so that the real-time performance of information and the timeliness of decision making are affected. Therefore, improving the efficiency of text processing is an important task to solve the current problem.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a multi-language text data analysis system and a multi-language text data analysis method, which are used for solving the problems in the background art.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme: a multi-language text data analysis method comprises the following steps,

step one, collecting text data, multi-language event reports, news reports and propaganda texts related to multi-language smart city social media, and establishing a text data set;

training a multi-language mapping model, collecting large-scale text corpus from different languages, extracting training words to be embedded into the multi-language mapping model for deep training, and optimizing the multi-language mapping model after the different languages share the same embedding matrix for realizing by adopting a shared word embedding space;

thirdly, after first text processing is carried out on the text data set, extracting a source language and a target language of the text data set, carrying out semantic space modeling on the basis of a multi-language mapping model on the source language and the target language, translating to obtain a first target language translation result, collecting difference information between the source language and the first target language translation result, and calculating to obtain a difference coefficient Cy through the difference information;

comparing the obtained difference coefficient Cy with a standard similarity threshold value R, when the difference coefficient Cy is smaller than or equal to the standard similarity threshold value R, indicating that the translation of the first target language is qualified, and establishing a first correction data set by sorting a text data set which is qualified in translation according to the structure size; when the difference coefficient Cy is larger than the standard similarity threshold value R, the first target language translation is not qualified, and the regions with the translation errors are matched and positioned to form error text regions;

and fifthly, establishing an analysis frame model, mapping the first correction data set into the analysis frame model, extracting the frame structure of the analysis data set, carrying out second indentation processing on a plurality of frame structures, and then carrying out corresponding classification according to the smart city keywords.

Preferably, the third step includes:

s31, performing first text processing on the text data set, wherein the first text processing comprises word segmentation on a source language in each text data, and removing stop words, punctuation marks and noise processing;

s32, performing word drying or word reduction on the word after word segmentation to reduce word deformation;

s33, translating the source language text into a target language in the shared semantic space through a multi-language mapping model, and obtaining a first target language translation result through a sentence vector translation mode;

s34, collecting difference information between the translation results of the source language and the translation results of the first target language, and calculating to obtain a difference coefficient Cy through the difference information.

Preferably, the difference coefficient Cy is calculated by the following three calculation methods:

(1) Setting upVectors representing sentences in the source language, < >>Sentence vector after the target language translation is represented, and difference coefficient Cy is obtained through Euclidean distance calculation:

where n represents the dimension of the vector, i.e., the length of word embedding in the context of the vector, and i represents the index of the dimension of the vector;

(2) Calculating vectors of source language sentencesSentence vector translated with target language +.>Is calculated by the degree of dispersion to obtain the coefficient of difference Cy:

(3) Calculation ofVectors of source language sentencesSentence vector translated with target language +.>The difference coefficient Cy is calculated by the following formula:

when the difference coefficient Cy is less than or equal to the standard similarity threshold value R, the translation of the first target language is qualified; and if the difference coefficient Cy is larger than the standard similarity threshold value R, the translation of the first target language is unqualified, and the larger the difference is, the lower the translation quality is.

Preferably, when the difference coefficient Cy is larger than the standard similarity threshold value R, the first target language translation is unqualified and has abnormal difference; and the text which is not qualified in the translation of the first target language is matched and positioned according to the difference value of the difference coefficient Cy and the standard similarity threshold value R, and an error text area is formed.

Preferably, the text error area is acquired, and a second text supplementing process is performed, including: re-translation, manual review, term revision, context adjustment, and fine tuning of the multilingual mapping model; acquiring a second improved text after the second text is processed; and repeating the second to fourth steps until the difference coefficient Cy is less than or equal to the standard similarity threshold R, and iteratively replacing the second improved text with the text with unqualified translation in the error text region and incorporating the text into the first correction data set.

Preferably, the re-translation is used for re-translating text regions marked as errors by using different translation models, and a plurality of machine translation engines are used for comparing translation results of different engines so as to select the best translation;

the manual auditing is used for manually auditing the error region by a translation expert, and the specific problem type and auditing and repairing result of the error region are marked in the manual auditing process;

the term revision is used for updating the database according to the big data and revising the terms in the smart city field in the error area;

the context adjustment is used for analyzing the context of the error area, carrying out corresponding context reconstruction on the context, and conforming to the culture and expression habit of the target language localization area.

Preferably, the fifth step includes:

s51, defining an analysis framework comprising focused topics, keywords and entities according to common keywords required by the smart city task by adopting a natural language processing technology;

s52, extracting and analyzing a structure defined in the frame model from the setting data of the first correction data set, and carrying out second indentation processing on each text data in the first correction data set;

the second indentation process is used for dividing each text data into 1-3 paragraphs, extracting keyword information in 1-3 paragraphs, including nouns, verbs and adjectives for the maximum number of times, obtaining a first matched verb, a first matched verb and a first matched adjective, and combining the first matched verb, the first matched verb and the first matched adjective to form a first complex small sentence XJ1;

s53, extracting application words of related smart city scenes in 1-3 paragraphs of each text data, including intelligent transportation, internet of things, energy management, public service, smart medical treatment, smart security, environment detection, smart education, smart retail and smart communities, quantifying the occurrence times of the extracted application words and application words, and fusing the extracted application words and application words at the head and tail of a first compound sentence XJ1 through Yyc/x times as labels to form a second compound sentence XJ2, so that each text data corresponds to one text sentence with application word information.

Preferably, the fifth step further includes:

s54, defining keywords in the smart city field according to the subject information of the smart city;

s55, matching the keywords in the S54 with the second complex clause XJ2, and adding a corresponding classification label to each text data for the successfully matched keywords; if the matching is unsuccessful, adding a default classification label to indicate that the text data does not belong to the smart city field;

s56, manually checking the text data which is not successfully matched automatically, and automatically adding the classification labels if the text data is related.

A multi-language text data analysis system comprises a data acquisition module, a training model module, a text preprocessing module, a translation module, a difference evaluation module, an analysis frame modeling module, an application word quantization and matching module, a re-translation and manual auditing module and an automatic classification module;

the data acquisition module is used for acquiring text data, multilingual event reports, news reports and propaganda texts related to multilingual smart city social media and establishing a text data set;

the training model module is used for collecting large-scale text corpus from different languages, extracting training words, embedding the training words into the multi-language mapping model for deep training, and optimizing the multi-language mapping model;

the text preprocessing module is used for performing first text processing on the text data set, performing word segmentation, removing stop words, punctuation marks and noise processing, and performing word stem or word reduction;

the translation module is used for rewarding the multilingual mapping model to translate the source language text into the target language, and then calculating the difference information between the source language translation result and the target language translation result to obtain a difference coefficient Cy;

the difference evaluation module is used for comparing the obtained difference coefficient Cy with a standard similarity threshold R, determining whether translation is qualified or not, and generating a first correction data set when the translation is qualified; generating an error text region when the text region is unqualified;

the analysis framework modeling module is used for establishing an analysis framework model, defining an analysis framework according to common keywords required by a smart city task, extracting a framework structure of text data of a first correction data set, carrying out second indentation processing to generate a first complex small sentence XJ1, extracting application words by Yyc/x times, wherein Yyc represents the application words, x represents the times of extracting the application words, and the application words are fused at the head and tail positions of the first complex small sentence XJ1 as labels to form a second complex small sentence XJ2;

the application word quantization and matching module is used for extracting application words, quantizing and matching keywords and adding classification labels, and the sub-module is used for matching keywords in the first compound sentence XJ1 and adding corresponding classification labels;

the re-translation and manual auditing module is used for performing second text supplementing processing such as re-translation, manual auditing, term revising, context adjusting and the like on the error text region to generate a second improved text;

the automatic classification module is used for matching the second complex small sentence XJ2 with the keywords, adding classification labels, performing association analysis and topic extraction, and generating a first analysis result.

(III) beneficial effects

The invention provides a multilingual text data analysis system and a multilingual text data analysis method. The beneficial effects are as follows:

(1) The method improves the translation accuracy by adopting an advanced multilingual mapping model and a semantic space modeling technology, and accurately evaluates the translation quality by a difference coefficient calculation mode. According to the method, when the translation is unqualified, the error position can be accurately positioned, an error text region is formed, and the repair is carried out in a targeted manner. In addition, the analysis framework model is established, the second indentation process is adopted, and classification is carried out through smart city keywords, so that the processing efficiency and the classification accuracy of the large-scale multilingual text data are improved. The technical scheme can better meet the requirements of processing the multi-language text data of the smart city, makes up the defects of the traditional text processing method in accuracy and efficiency, and provides reliable data support for the development of the smart city.

(2) And according to the comparison of the difference coefficient Cy and the standard similarity threshold R, judging whether the first target language translation is qualified or not. The processing flow is beneficial to improving the accuracy of translation, and simultaneously provides a quantitative evaluation mode to better meet the requirement of multilingual text data processing. By the aid of the method for calculating the difference information, translation quality can be effectively estimated, and guidance is provided for subsequent correction and improvement.

(3) By adopting natural language processing technology, an analysis framework is defined according to common keywords required by the smart city task, wherein the analysis framework comprises topics, keywords and entities of interest. The sensitivity to specific content in the smart city field is improved, so that the text data is analyzed more pertinently and accurately. Through paragraph division and keyword extraction, the subjects and key points of the text are better captured, and subsequent classification and theme label generation are facilitated. And extracting application words related to smart city scenes in 1-3 paragraphs of each text data, such as intelligent transportation, internet of things, energy management, public service, smart medical treatment and the like, quantifying the occurrence times of the application words, and fusing the application words at the head and tail positions of the first compound small sentence XJ1 to form a second compound small sentence XJ2. The relevance of the smart city domain of the text is further enhanced by merging the application word information into the text clause.

Drawings

FIG. 1 is a schematic diagram showing steps of a method for analyzing multilingual text data according to the present invention;

FIG. 2 is a block diagram of a multi-language text data analysis system according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The invention provides a multi-language text data analysis method, please refer to fig. 1, comprising the following steps,

comparing the obtained difference coefficient Cy with a standard similarity threshold value R, when the difference coefficient Cy is smaller than or equal to the standard similarity threshold value R, indicating that the translation of the first target language is qualified, and establishing a first correction data set by sorting a text data set which is qualified in translation according to the structure size; when the difference coefficient Cy is larger than the standard similarity threshold value R, the first target language translation is not qualified, and the regions with the translation errors are matched and positioned to form error text regions; the method and the device provide positioning of the specific position of the translation error and provide accurate operation targets for subsequent correction.

Acquiring a text error area and performing second text supplement processing, wherein the second text supplement processing comprises the following steps: re-translation, manual review, term revision, context adjustment, and fine tuning of the multilingual mapping model; acquiring a second improved text after the second text is processed; and repeating the second to fourth steps until the difference coefficient Cy is less than or equal to the standard similarity threshold R, and iteratively replacing the second improved text with the text with unqualified translation in the error text region and incorporating the text into the first correction data set.

Re-translating text regions marked as errors, re-translating the text regions by using different translation models, and using a plurality of machine translation engines so as to compare translation results of the different engines and select the best translation;

the manual auditing is used for manually auditing the error region by a translation expert, and the specific problem type and auditing and repairing result of the error region are marked in the manual auditing process; the method combines automation and manual intervention, improves the overall level of translation quality, and ensures that the translation result is more in line with the use of actual context and technical terms.

the context adjustment is used for analyzing the context of the error area, carrying out corresponding context reconstruction on the context, and conforming to the culture and expression habit of the target language localization area. Repeated correction of text with unqualified translation is ensured, the translation result gradually approaches to the standard, and the correction robustness is improved.

In the embodiment, the method improves the translation accuracy by adopting an advanced multilingual mapping model and a semantic space modeling technology, and accurately evaluates the translation quality by a difference coefficient calculation mode. According to the method, when the translation is unqualified, the error position can be accurately positioned, an error text region is formed, and the repair is carried out in a targeted manner. In addition, the analysis framework model is established, the second indentation process is adopted, and classification is carried out through smart city keywords, so that the processing efficiency and the classification accuracy of the large-scale multilingual text data are improved. The technical scheme can better meet the requirements of processing the multi-language text data of the smart city, makes up the defects of the traditional text processing method in accuracy and efficiency, and provides reliable data support for the development of the smart city

Example 2, this example is the explanation performed in example 1, specifically, the third step includes:

The difference coefficient Cy is calculated by the following three calculation methods:

(3) Calculating vectors of source language sentencesSentence vector translated with target language +.>The difference coefficient Cy is calculated by the following formula:

In this embodiment, whether the first target language translation is qualified can be determined according to the comparison between the difference coefficient Cy and the standard similarity threshold R. The processing flow is beneficial to improving the accuracy of translation, and simultaneously provides a quantitative evaluation mode to better meet the requirement of multilingual text data processing. By the aid of the method for calculating the difference information, translation quality can be effectively estimated, and guidance is provided for subsequent correction and improvement.

Embodiment 3, which is an explanation of embodiment 1, specifically, the fifth step includes:

s53, extracting application words of related smart city scenes in 1-3 paragraphs of each text data, including intelligent transportation, internet of things, energy management, public service, smart medical treatment, smart security, environment detection, smart education, smart retail and smart communities, quantifying the occurrence times of the extracted application words and application words, and fusing the extracted application words and application words at the head and tail of a first compound sentence XJ1 by using Yyc/x times as labels to form a second compound sentence XJ2, wherein each text data corresponds to one text sentence with application word information;

s56, manually checking the text data which is not successfully matched automatically, and automatically adding the classification labels if the text data is related. And a manual auditing link is introduced, so that the coping capability for special situations is improved, and the accuracy and the integrity of text data are ensured.

In this embodiment, a natural language processing technology is adopted, and according to the common keywords required by the smart city task, an analysis framework is defined, including the topics, keywords and entities of interest. The sensitivity to specific content in the smart city field is improved, so that the text data is analyzed more pertinently and accurately. Through paragraph division and keyword extraction, the subjects and key points of the text are better captured, and subsequent classification and theme label generation are facilitated. And extracting application words related to smart city scenes in 1-3 paragraphs of each text data, such as intelligent transportation, internet of things, energy management, public service, smart medical treatment and the like, quantifying the occurrence times of the application words, and fusing the application words at the head and tail positions of the first compound small sentence XJ1 to form a second compound small sentence XJ2. The relevance of the smart city domain of the text is further enhanced by merging the application word information into the text clause.

Embodiment 4 referring to fig. 2, a multilingual text data analysis system includes a data acquisition module, a training model module, a text preprocessing module, a translation module, a difference evaluation module, an analysis frame modeling module, an application word quantization and matching module, a re-translation and manual auditing module, and an automatic classification module;

the data acquisition module is used for acquiring text data, multilingual event reports, news reports and propaganda texts related to multilingual smart city social media and establishing a text data set; providing a source of raw data for the system, ensuring that there is sufficient multilingual text data support for subsequent analysis.

The training model module is used for collecting large-scale text corpus from different languages, extracting training words, embedding the training words into the multi-language mapping model for deep training, and optimizing the multi-language mapping model; the accuracy and adaptability of multi-language text processing are improved, so that the system can better understand the association between different languages.

The text preprocessing module is used for performing first text processing on the text data set, performing word segmentation, removing stop words, punctuation marks and noise processing, and performing word stem or word reduction; the quality of text data is improved, and a better basis is provided for subsequent translation and analysis.

The translation module is used for rewarding the multilingual mapping model to translate the source language text into the target language, and then calculating the difference information between the source language translation result and the target language translation result to obtain a difference coefficient Cy; the translation quality is evaluated, and an important basis is provided for subsequent text processing.

the analysis framework modeling module is used for establishing an analysis framework model, defining an analysis framework according to common keywords required by a smart city task, extracting a framework structure of text data of a first correction data set, carrying out second indentation processing to generate a first complex small sentence XJ1, extracting application words by Yyc/x times, wherein Yyc represents the application words, x represents the times of extracting the application words, and the application words are fused at the head and tail positions of the first complex small sentence XJ1 as labels to form a second complex small sentence XJ2; the method provides structural understanding of the smart city text, and lays a foundation for subsequent classification and analysis.

The application word quantization and matching module is used for extracting application words, quantizing and matching keywords and adding classification labels, and the sub-module is used for matching keywords in the first compound sentence XJ1 and adding corresponding classification labels; automatically associate the text to the smart city domain, and realize automatic classification of the text data.

The re-translation and manual auditing module is used for performing second text supplementing processing such as re-translation, manual auditing, term revising, context adjusting and the like on the error text region to generate a second improved text; the quality and the accuracy of the text data are improved, and the error text is effectively processed.

The automatic classification module is used for matching the second complex small sentence XJ2 with the keywords, adding classification labels, performing association analysis and topic extraction, and generating a first analysis result. The classification of the text is automatically completed, and an automatic analysis result of the related subject of the smart city is provided.

In the embodiment, the system realizes the whole-flow processing of the multilingual smart city text through the synergistic effect of the modules, and effectively improves the intelligent level of text processing.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A multi-language text data analysis method is characterized in that: comprises the steps of,

2. A method of multi-lingual text data analysis according to claim 1, wherein: the third step comprises the following steps:

3. A method of multi-lingual text data analysis according to claim 2, wherein: the difference coefficient Cy is calculated by the following three calculation methods:

；

when the difference coefficient Cy is less than or equal to the standard similarity threshold value R, the translation of the first target language is qualified; and if the difference coefficient Cy is larger than the standard similarity threshold R, the translation of the first target language is unqualified, and the translation quality is lower as the difference is larger.

4. A method of multi-lingual text data analysis according to claim 1, wherein: when the difference coefficient Cy is larger than the standard similarity threshold value R, the first target language translation is unqualified and has abnormal difference; and the text which is not qualified in the translation of the first target language is matched and positioned according to the difference value of the difference coefficient Cy and the standard similarity threshold value R, and an error text area is formed.

5. The method for multi-lingual text data analysis of claim 4 wherein: acquiring a text error area and performing second text supplement processing, wherein the second text supplement processing comprises the following steps: re-translation, manual review, term revision, context adjustment, and fine tuning of the multilingual mapping model; acquiring a second improved text after the second text is processed; and repeating the second to fourth steps until the difference coefficient Cy is less than or equal to the standard similarity threshold R, and iteratively replacing the second improved text with the text with unqualified translation in the error text region and incorporating the text into the first correction data set.

6. The method for multi-lingual text data analysis of claim 5 wherein: re-translating text regions marked as errors, re-translating the text regions by using different translation models, and using a plurality of machine translation engines so as to compare translation results of the different engines and select the best translation;

7. A method of multi-lingual text data analysis according to claim 1, wherein: the fifth step comprises the following steps:

8. The method for multi-lingual text data analysis of claim 7 wherein: the fifth step further comprises:

9. A multi-lingual text data analysis system comprising a multi-lingual text data analysis method according to any of the claims 1-8, characterized in that: the system comprises a data acquisition module, a training model module, a text preprocessing module, a translation module, a difference evaluation module, an analysis framework modeling module, an application word quantization and matching module, a re-translation and manual auditing module and an automatic classification module;