CN106844648A - A kind of method and system that scarcity of resources language comparable corpora is built based on picture - Google Patents

A kind of method and system that scarcity of resources language comparable corpora is built based on picture Download PDF

Info

Publication number
CN106844648A
CN106844648A CN201710047514.6A CN201710047514A CN106844648A CN 106844648 A CN106844648 A CN 106844648A CN 201710047514 A CN201710047514 A CN 201710047514A CN 106844648 A CN106844648 A CN 106844648A
Authority
CN
China
Prior art keywords
language
text
scarcity
resources
aboundresources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710047514.6A
Other languages
Chinese (zh)
Other versions
CN106844648B (en
Inventor
王志娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Minzu University of China
Original Assignee
Minzu University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minzu University of China filed Critical Minzu University of China
Priority to CN201710047514.6A priority Critical patent/CN106844648B/en
Publication of CN106844648A publication Critical patent/CN106844648A/en
Application granted granted Critical
Publication of CN106844648B publication Critical patent/CN106844648B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of method and system that scarcity of resources language comparable corpora is built based on picture, the method includes:S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage includes the picture in text;S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text, as aboundresources language text;S130, the webpage to scarcity of resources language and aboundresources language carries out feature extraction;S140, has the scarcity of resources language of same or similar picture and the Similarity value of aboundresources language web page based on the feature calculation;S150, chooses comparable text of the Similarity value highest aboundresources language text as scarcity of resources language text;S160, repeats S120 S150, until all scarcity of resources language web pages find the comparable text of aboundresources language.The present invention do not limited by scarcity of resources language information processing technology and resource, can at lower cost, across the language comparable corpora of rapid build scarcity of resources language.

Description

A kind of method and system that scarcity of resources language comparable corpora is built based on picture
Technical field
Scarcity of resources language is built the present invention relates to the technical field of information processing of language, more particularly to a kind of picture that is based on The method and system of comparable corpora.
Background technology
It is the important means for carrying out across language natural language processing research across language corpus, according to corpus intertranslation degree Difference, can be divided into Parallel Corpus and comparable corpora across language corpus.Parallel Corpus (Parallel Corpus) is There is strict intertranslation between bilingual text to gathering in the text being made up of the target language text of source language text and translation Relation, corpus quality is high, is the valuable source for carrying out cross-language information treatment research, but Parallel Corpus builds difficulty Greatly, construction cost is high;Comparable corpora (comparable corpora) is then that language is different, content is similar but the text of non-intertranslation This is related to the word of the different language text of same subject, sentence, paragraph to be not necessarily present one-to-one translation and closes to set System, for comparable language material is compared with parallel corpora, resource, compared with horn of plenty, is to build the important supplement across language corpus.
With the propulsion that natural language processing is studied, research object is also from aboundresources language (High resource Languages, such as English, Chinese, Japanese, Spanish) expand to scarcity of resources language (Low resource Language, such as Hausa, Bengali, Tibetan language, Uighur), not only population in use is few for scarcity of resources language, Er Qiezi Source is few, language material procurement cost is high, and the Parallel Corpus that scarcity of resources language is built in this case is extremely difficult therefore comparable Corpus is the valuable source of across the language natural language processing research of scarcity of resources language.
For aboundresources language, the method that comparable corpora is built at present mainly has three kinds:Content characteristic matching, Cross-language information retrieval, wikipedia.Comparable language material construction method based on content characteristic needs to extract text feature and double The support of dictionary, because the text feature extraction technique of scarcity of resources language is limited, and the bilingual dictionary of scarcity of resources language Mainly cover some everyday words, it is impossible to meet the demand of cypher text feature, therefore currently without method by special based on content The method levied is extensive, high-quality builds the comparable corpora of scarcity of resources language.Built based on cross-language information retrieval comparable Corpus drastically increases the extensive speed than language material collection, and wherein key issue is the selection of query word, and this is straight Connect the correlation degree for determining source document and target document.But for scarcity of resources language, one side one It is also that restriction is carried out using the method that a little scarcity of resources language do not have search engine system, the translation quality of another aspect query word The important bottleneck that scarcity of resources language comparable corpora builds.The resource of scarcity of resources language is less in current wikipedia, and Distribution of content is uneven, it is difficult to pass through the comparable corpora that wikipedia builds extensive, high-quality scarcity of resources language.
Building the method for comparable corpora at present not only needs Text character extraction, keyword abstraction, cross-language information to examine The support of the technologies such as rope, machine translation, in addition it is also necessary to which the resource such as dictionary, wikipedia, Wordnet or knowledge base are supported.For For scarcity of resources language, the resource such as one side dictionary, knowledge base, wikipedia is more burst general;On the other hand, scarcity of resources language The information processing technology of speech, such as keyword abstraction, cross-language information retrieval, the development of machine translation technology are more delayed, not enough To support the structure across language comparable corpora of scarcity of resources language.I.e. not only resource is few for scarcity of resources language, and resource The information processing technology (such as keyword abstraction, machine translation, information retrieval technique) of rare language causes to build aboundresources Language builds than the comparable corpora that the method for language material is generally unsuitable for scarcity of resources language.
The content of the invention
The present invention is the deficiency of solution scarcity of resources language existing information treatment technology, it is proposed that a kind of to be built based on picture The method and system of scarcity of resources language comparable corpora.
On the one hand, the embodiment of the invention provides a kind of side that scarcity of resources language comparable corpora is built based on picture Method, including:
S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage is included in text Picture;
S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text, As aboundresources language text;
S130, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes: The numeral in picture, text issuing time, text, time and name entity in text;
S140, the scarcity of resources language and aboundresources language net based on the feature calculation with same or similar picture The Similarity value of page;
S150, chooses Similarity value highest aboundresources language text as scarcity of resources language text than text This;
Repeat S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language can Untill text.
Preferably, using image searching method money of the search comprising the same or similar picture of scarcity of resources language text The webpage of source plentiful language.
Preferably, it is further comprising the steps of before S140:Based on transliteration and simple free translation to the numeral in text, time with And Named entity translation.
Preferably, the scarcity of resources language and aboundresources based on the feature calculation with same or similar picture The similarity of language web page, specifically calculates the resource with same or similar picture based on the feature according to RBF The similarity of rare language and aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
It is further preferred that the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text Similarity situation of the entity in scarcity of resources language web page and aboundresources language web page, assigns different weighted values respectively.
On the other hand, the embodiment of the invention provides and a kind of scarcity of resources language comparable corpora is built based on picture be System, including:
Download module, the webpage for downloading scarcity of resources language, and as scarcity of resources language text, the webpage bag Include the picture in text;
Search module, for searching for the aboundresources language comprising the same or similar picture of scarcity of resources language text Webpage, as aboundresources language text;
Extraction module, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, the feature Including:The numeral in picture, text issuing time, text, time and name entity in text;
Computing module, the scarcity of resources language and aboundresources language based on the feature calculation with same or similar picture Say the Similarity value of webpage;
Module is chosen, the best aboundresources language text of Similarity value is chosen as the comparable of scarcity of resources language text Text.
Preferably, the search module, the scarcity of resources language is included specifically for the search of application image searching method The webpage of the aboundresources language of the same or similar picture of text.
Preferably, the system also includes:Translation module, for based on transliteration and simple free translation to the numeral in text, Time and Named entity translation.
Preferably, the computing module, specifically for based on the feature according to RBF calculate have it is identical or The scarcity of resources language of similar pictures and the similarity of aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
It is further preferred that the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text Similarity situation of the entity in less language web page and more language web page, assigns different weighted values respectively.
The method and system for building scarcity of resources language comparable corpora based on picture provided in an embodiment of the present invention are not received Some information processing technologies (such as keyword abstraction, machine translation, information retrieval) and resource (bilingual dictionary, wikipedia etc.) Limitation, can at lower cost, across the language comparable corpora of the high-quality scarcity of resources language of rapid build, and then be resource The natural language processing of rare language provides resource.
Brief description of the drawings
Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be to that will make needed for embodiment description Accompanying drawing is briefly described.It should be evident that drawings in the following description are only some embodiments of the present invention.
Fig. 1 is that the method flow for building scarcity of resources language comparable corpora based on picture provided in an embodiment of the present invention shows It is intended to;
Fig. 2 (a) is the Chinese text citing of the same subject based on picture searching;
Fig. 2 (b) is the English text citing of the same subject based on picture searching;
Fig. 2 (c) is the Arabic text citing of the same subject based on picture searching;
Fig. 2 (d) is Spain's text citing of the same subject based on picture searching;
Fig. 2 (e) is traditional Monggol language text citing of the same subject based on picture searching;
Fig. 2 (f) is the Tibetan language text citing of the same subject based on picture searching;
Fig. 3 is a kind of specific embodiment flow chart provided in an embodiment of the present invention;
Fig. 4 is a kind of system knot that scarcity of resources language comparable corpora is built based on picture provided in an embodiment of the present invention Structure schematic diagram.
Specific embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In drawings and Examples, the technical scheme in the embodiment of the present invention is explicitly described.
Fig. 1 is provided in an embodiment of the present invention based on a kind of side that scarcity of resources language comparable corpora is built based on picture Method schematic flow sheet, as described in Figure 1, the method includes:
S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage is included in text Picture.
S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text, As aboundresources language text.
Fig. 2 is the multi-language text citing of the same subject based on picture searching, and specifically, (a) is Chinese text, (b) It is English text, (c) is Arabic text, and (d) is Spain's text, and (e) is traditional Monggol language text, and (f) is Tibetan language text This.As shown in Figure 2, picture can not be limited by language, intuitively be reacted text subject, possess the different languages of same or similar picture Speech text is usually same subject, therefore can collect or phase identical with scarcity of resources language text by the method for picture searching As other language web pages, as aboundresources language text.
S130, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes: The numeral in picture, text issuing time, text, time and name entity in text.
S140, the scarcity of resources language and aboundresources language net based on the feature calculation with same or similar picture The Similarity value of page.
It should be noted that also including before S140:Based on transliteration and simple free translation to the numeral in text, the time and Named entity translation.
Specifically, calculated according to radial direction base (Radial Basis Function, abbreviation RBF) function based on the feature The similarity of scarcity of resources language and aboundresources language web page with same or similar picture:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
Wherein, the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text Similarity situation of the entity in less language web page and more language web page, assigns different weighted values respectively.
S150, chooses the best aboundresources language text of Similarity value as scarcity of resources language text than text This.
S160, repeats S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language Untill the comparable text of speech.
Scarcity of resources language is built than data below by how specific example is based on picture to the embodiment of the present invention The process in storehouse is illustrated.
Fig. 3 is a kind of specific embodiment flow chart provided in an embodiment of the present invention, as shown in figure 3, building money based on picture The rare language in source is specific as follows than the process of data bank:
S110, downloads the webpage containing scarcity of resources language, and used as scarcity of resources language text, the webpage includes text Picture in this.
On the internet, the webpage containing scarcity of resources language is downloaded, total quantity is m.
Whether contain pictorial information in k-th webpage for first determining whether scarcity of resources language, if not containing picture, judge Whether contain picture in kth -- webpage (webpage of kth -1, i.e., next webpage);If containing picture in kth webpage, the webpage As scarcity of resources language text.
Judge whether all scarcity of resources language web pages containing picture have been processed, if having processed, build money The rare language comparable corpora in source leaves it at that;If not processed, comparable corpora is carried out to scarcity of resources language text i Structure.
Processing procedure to scarcity of resources language text i is as follows:
S120, the resource with the scarcity of resources language text same or similar pictures of i is included using the search of picture research tool The webpage of plentiful language, as aboundresources language text j.
S130, feature extraction is carried out to scarcity of resources language text i and language-specific text j, and the feature includes:Text In picture, text issuing time, text in numeral, the time and name entity.
Wherein, the picture in text belongs to the outer feature of text, and numeral, time and life in text issuing time, text Name entity belongs to feature in text.
S140, first, based on transliteration and simple free translation to the numeral in text, time and Named entity translation;Then, Scarcity of resources language and aboundresources language web page with same or similar picture are calculated according to RBF functions based on the feature Similarity:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
Wherein, the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text Similarity situation of the entity in scarcity of resources language web page and aboundresources language web page, assigns different weighted values respectively. Such as:At the initial stage of experimentation, can assign respectively time in picture, text issuing time, the text in text, numeral with And name entity respectively accounts for 1/5 weighted value, the later stage, further according to similar situation, adjusts the weighted value of each feature, untill suitable.
S150, chooses Similarity value highest aboundresources language text as scarcity of resources language text than text This, is put into the comparable expectation storehouse of scarcity of resources language.
S160, repeats S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language Untill the comparable text of speech.
Corresponding with above method embodiment, the embodiment of the present invention additionally provides a kind of based on picture structure scarcity of resources The system of language comparable corpora is specific as shown in figure 4, the system 400 includes:Download module 401, search module 402, extraction Module 403, computing module 404 and selection module 405.
Download module 401, the webpage for downloading scarcity of resources language, as scarcity of resources language text, the webpage Including the picture in text.
Search module 402, for searching for the aboundresources comprising the same or similar picture of scarcity of resources language text The webpage of language, as aboundresources language text.
Extraction module 403, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, the spy Levy including:The numeral in picture, text issuing time, text, time and name entity in text.
Computing module 404, has the scarcity of resources language and resource of same or similar picture rich based on the feature calculation The Similarity value of rich language web page.
Module 405 is chosen, Similarity value highest aboundresources language text is chosen as scarcity of resources language text Than text.
The search module 401, the scarcity of resources language text is included specifically for the search of application image searching method The webpage of the aboundresources language of same or similar picture.
The system also includes:Translation module 406, for based on transliteration and simple free translation to numeral, the time in text And Named entity translation.
The computing module 404, specifically for being calculated with same or similar according to RBF based on the feature The scarcity of resources language of picture and the similarity of aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.Wherein, the text The weight of this similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text Similarity situation of the entity in less language web page and more language web page, assigns different weighted values respectively.
Each portion in the above-mentioned system that scarcity of resources language comparable corpora is built based on picture provided in an embodiment of the present invention One kind that function performed by part is provided in above-described embodiment is based on picture and builds scarcity of resources language comparable corpora Method in be discussed in detail, repeat no more here.
The system for building scarcity of resources language comparable corpora based on picture provided in an embodiment of the present invention is not believed by some Breath treatment technology (such as keyword abstraction, machine translation, information retrieval) and the limitation of resource (bilingual dictionary, wikipedia), Can at lower cost, across the language comparable corpora of the high-quality scarcity of resources language of rapid build, and then be scarcity of resources The natural language processing of language provides resource.
Professional should further appreciate that, each example described with reference to the embodiments described herein Unit and algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, hard in order to clearly demonstrate The interchangeability of part and software, generally describes the composition and step of each example according to function in the above description. These functions are performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme. Professional and technical personnel can realize described function to each specific application using distinct methods, but this realization is not It is considered as beyond the scope of this invention.
One of ordinary skill in the art will appreciate that all or part of step in realizing above-described embodiment method can be Completed come instruction processing unit by program, described program can be stored in computer-readable recording medium, and storage medium is Non-transitory (non-transitory) medium, such as random access memory, read-only storage, flash memory, hard disk, Gu State hard disk, tape (magnetic tape), floppy disk (floppy disk), CD (optical disc) and its any combination. More than, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto.

Claims (10)

1. it is a kind of based on picture build scarcity of resources language comparable corpora method, it is characterised in that including:
S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage includes the figure in text Piece;
S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text, as Aboundresources language text;
S130, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes:Text In picture, text issuing time, text in numeral, the time and name entity;
S140, the scarcity of resources language and aboundresources language web page based on the feature calculation with same or similar picture Similarity value;
S150, chooses comparable text of the Similarity value highest aboundresources language text as scarcity of resources language text;
S160, repeats S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language Than untill text.
2. method according to claim 1, it is characterised in that application image searching method search includes the scarcity of resources The webpage of the aboundresources language of the same or similar picture of language text.
3. method according to claim 1, it is characterised in that further comprising the steps of before S140:
Based on transliteration and simple free translation to the numeral in text, time and Named entity translation.
4. method according to claim 1, it is characterised in that described that same or similar figure is had based on the feature calculation The scarcity of resources language of piece and the similarity of aboundresources language web page, specifically based on the feature according to RBF meter Calculator has the scarcity of resources language of same or similar picture and the similarity of aboundresources language web page:
w i j = exp ( - Σ d = 1 n β d ( x i d - y j d ) 2 σ 2 )
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βdIt is text The weight of this similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
5. method according to claim 4, it is characterised in that the weight of the text similarity feature is in the following manner Obtain:
In experimentation, the time in picture, text issuing time, text, numeral and name entity in text Similarity situation in less language web page and more language web page, assigns different weighted values respectively.
6. it is a kind of based on picture build scarcity of resources language comparable corpora system, it is characterised in that including:
Download module, the webpage for downloading scarcity of resources language, and used as scarcity of resources language text, the webpage includes text Picture in this;
Search module, the net for searching for the aboundresources language comprising the same or similar picture of scarcity of resources language text Page, as aboundresources language text;
Extraction module, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes: The numeral in picture, text issuing time, text, time and name entity in text;
Computing module, the scarcity of resources language and aboundresources language net based on the feature calculation with same or similar picture The Similarity value of page;
Module is chosen, Similarity value highest aboundresources language text is chosen as scarcity of resources language text than text This.
7. system according to claim 6, it is characterised in that the search module, specifically for application picture searcher The webpage of method aboundresources language of the search comprising the same or similar picture of scarcity of resources language text.
8. system according to claim 6, it is characterised in that the system also includes:
Translation module, for based on transliteration and simple free translation to the numeral in text, time and Named entity translation.
9. system according to claim 6, it is characterised in that the computing module, specifically for based on the characteristic root The similarity of scarcity of resources language and aboundresources language web page with same or similar picture is calculated according to RBF:
w i j = exp ( - Σ d = 1 n β d ( x i d - y j d ) 2 σ 2 )
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βdIt is text The weight of this similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
10. system according to claim 9, it is characterised in that the weight of the text similarity feature is by with lower section Formula is obtained:
In experimentation, the time in picture, text issuing time, text, numeral and name entity in text Similarity situation in less language web page and more language web page, assigns different weighted values respectively.
CN201710047514.6A 2017-01-22 2017-01-22 A kind of method and system based on picture building scarcity of resources language comparable corpora Expired - Fee Related CN106844648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710047514.6A CN106844648B (en) 2017-01-22 2017-01-22 A kind of method and system based on picture building scarcity of resources language comparable corpora

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710047514.6A CN106844648B (en) 2017-01-22 2017-01-22 A kind of method and system based on picture building scarcity of resources language comparable corpora

Publications (2)

Publication Number Publication Date
CN106844648A true CN106844648A (en) 2017-06-13
CN106844648B CN106844648B (en) 2019-07-26

Family

ID=59119432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710047514.6A Expired - Fee Related CN106844648B (en) 2017-01-22 2017-01-22 A kind of method and system based on picture building scarcity of resources language comparable corpora

Country Status (1)

Country Link
CN (1) CN106844648B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information
CN110147817A (en) * 2019-04-11 2019-08-20 北京搜狗科技发展有限公司 Training data set creation method and device
CN111881900A (en) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 Corpus generation, translation model training and translation method, apparatus, device and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473280A (en) * 2013-08-28 2013-12-25 中国科学院合肥物质科学研究院 Method and device for mining comparable network language materials
US20150278197A1 (en) * 2014-03-31 2015-10-01 Abbyy Infopoisk Llc Constructing Comparable Corpora with Universal Similarity Measure
CN106202065A (en) * 2016-06-30 2016-12-07 中央民族大学 A kind of across language topic detecting method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473280A (en) * 2013-08-28 2013-12-25 中国科学院合肥物质科学研究院 Method and device for mining comparable network language materials
US20150278197A1 (en) * 2014-03-31 2015-10-01 Abbyy Infopoisk Llc Constructing Comparable Corpora with Universal Similarity Measure
CN106202065A (en) * 2016-06-30 2016-12-07 中央民族大学 A kind of across language topic detecting method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information
CN109815390B (en) * 2018-11-08 2023-08-08 平安科技(深圳)有限公司 Method, device, computer equipment and computer storage medium for retrieving multilingual information
CN110147817A (en) * 2019-04-11 2019-08-20 北京搜狗科技发展有限公司 Training data set creation method and device
CN110147817B (en) * 2019-04-11 2021-08-27 北京搜狗科技发展有限公司 Training data set generation method and device
CN111881900A (en) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 Corpus generation, translation model training and translation method, apparatus, device and medium
CN111881900B (en) * 2020-07-01 2022-08-23 腾讯科技(深圳)有限公司 Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium

Also Published As

Publication number Publication date
CN106844648B (en) 2019-07-26

Similar Documents

Publication Publication Date Title
Raulji et al. Stop-word removal algorithm and its implementation for Sanskrit language
US20040167770A1 (en) Methods and systems for language translation
US20070005649A1 (en) Contextual title extraction
JP2003532194A (en) Computer assisted reading system and method using interlanguage reading wizard
Nguyen-Hoang et al. TSGVi: a graph-based summarization system for Vietnamese documents
Vimal Kumar et al. An improvised extractive approach to hindi text summarization
CN106844648A (en) A kind of method and system that scarcity of resources language comparable corpora is built based on picture
Cai et al. Wikification via link co-occurrence
CN106294473B (en) Entity word mining method, information recommendation method and device
Batsuren et al. A large and evolving cognate database
Islam et al. Towards achieving a delicate blending between rule-based translator and neural machine translator
Yeom et al. Unsupervised-learning-based keyphrase extraction from a single document by the effective combination of the graph-based model and the modified C-value method
Awajan Semantic similarity based approach for reducing Arabic texts dimensionality
Görgün et al. A novel approach to morphological disambiguation for turkish
Wu et al. Learning multilingual topics with neural variational inference
Pęzik et al. Keyword extraction from short texts with a text-to-text transfer transformer
JPH05158401A (en) Document fast reading support/display system and document processor and document retrieving device
Rakholia et al. The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
Mohd et al. Sumdoc: a unified approach for automatic text summarization
Jain et al. Retrieving web search results using Max–Max soft clustering for Hindi query
Atwan et al. Impact of stemmer on arabic text retrieval
Eghbalzadeh et al. Persica: A Persian corpus for multi-purpose text mining and Natural language processing
Zhou et al. Cross-lingual embeddings with auxiliary topic models
Andersson et al. Exploring patent passage retrieval using nouns phrases
Kumar et al. Design and implementation of rule-based hindi stemmer for hindi information retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190726

Termination date: 20210122