CN106844648A - A kind of method and system that scarcity of resources language comparable corpora is built based on picture - Google Patents
A kind of method and system that scarcity of resources language comparable corpora is built based on picture Download PDFInfo
- Publication number
- CN106844648A CN106844648A CN201710047514.6A CN201710047514A CN106844648A CN 106844648 A CN106844648 A CN 106844648A CN 201710047514 A CN201710047514 A CN 201710047514A CN 106844648 A CN106844648 A CN 106844648A
- Authority
- CN
- China
- Prior art keywords
- language
- text
- scarcity
- resources
- aboundresources
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of method and system that scarcity of resources language comparable corpora is built based on picture, the method includes:S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage includes the picture in text;S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text, as aboundresources language text;S130, the webpage to scarcity of resources language and aboundresources language carries out feature extraction;S140, has the scarcity of resources language of same or similar picture and the Similarity value of aboundresources language web page based on the feature calculation;S150, chooses comparable text of the Similarity value highest aboundresources language text as scarcity of resources language text;S160, repeats S120 S150, until all scarcity of resources language web pages find the comparable text of aboundresources language.The present invention do not limited by scarcity of resources language information processing technology and resource, can at lower cost, across the language comparable corpora of rapid build scarcity of resources language.
Description
Technical field
Scarcity of resources language is built the present invention relates to the technical field of information processing of language, more particularly to a kind of picture that is based on
The method and system of comparable corpora.
Background technology
It is the important means for carrying out across language natural language processing research across language corpus, according to corpus intertranslation degree
Difference, can be divided into Parallel Corpus and comparable corpora across language corpus.Parallel Corpus (Parallel Corpus) is
There is strict intertranslation between bilingual text to gathering in the text being made up of the target language text of source language text and translation
Relation, corpus quality is high, is the valuable source for carrying out cross-language information treatment research, but Parallel Corpus builds difficulty
Greatly, construction cost is high;Comparable corpora (comparable corpora) is then that language is different, content is similar but the text of non-intertranslation
This is related to the word of the different language text of same subject, sentence, paragraph to be not necessarily present one-to-one translation and closes to set
System, for comparable language material is compared with parallel corpora, resource, compared with horn of plenty, is to build the important supplement across language corpus.
With the propulsion that natural language processing is studied, research object is also from aboundresources language (High resource
Languages, such as English, Chinese, Japanese, Spanish) expand to scarcity of resources language (Low resource
Language, such as Hausa, Bengali, Tibetan language, Uighur), not only population in use is few for scarcity of resources language, Er Qiezi
Source is few, language material procurement cost is high, and the Parallel Corpus that scarcity of resources language is built in this case is extremely difficult therefore comparable
Corpus is the valuable source of across the language natural language processing research of scarcity of resources language.
For aboundresources language, the method that comparable corpora is built at present mainly has three kinds:Content characteristic matching,
Cross-language information retrieval, wikipedia.Comparable language material construction method based on content characteristic needs to extract text feature and double
The support of dictionary, because the text feature extraction technique of scarcity of resources language is limited, and the bilingual dictionary of scarcity of resources language
Mainly cover some everyday words, it is impossible to meet the demand of cypher text feature, therefore currently without method by special based on content
The method levied is extensive, high-quality builds the comparable corpora of scarcity of resources language.Built based on cross-language information retrieval comparable
Corpus drastically increases the extensive speed than language material collection, and wherein key issue is the selection of query word, and this is straight
Connect the correlation degree for determining source document and target document.But for scarcity of resources language, one side one
It is also that restriction is carried out using the method that a little scarcity of resources language do not have search engine system, the translation quality of another aspect query word
The important bottleneck that scarcity of resources language comparable corpora builds.The resource of scarcity of resources language is less in current wikipedia, and
Distribution of content is uneven, it is difficult to pass through the comparable corpora that wikipedia builds extensive, high-quality scarcity of resources language.
Building the method for comparable corpora at present not only needs Text character extraction, keyword abstraction, cross-language information to examine
The support of the technologies such as rope, machine translation, in addition it is also necessary to which the resource such as dictionary, wikipedia, Wordnet or knowledge base are supported.For
For scarcity of resources language, the resource such as one side dictionary, knowledge base, wikipedia is more burst general;On the other hand, scarcity of resources language
The information processing technology of speech, such as keyword abstraction, cross-language information retrieval, the development of machine translation technology are more delayed, not enough
To support the structure across language comparable corpora of scarcity of resources language.I.e. not only resource is few for scarcity of resources language, and resource
The information processing technology (such as keyword abstraction, machine translation, information retrieval technique) of rare language causes to build aboundresources
Language builds than the comparable corpora that the method for language material is generally unsuitable for scarcity of resources language.
The content of the invention
The present invention is the deficiency of solution scarcity of resources language existing information treatment technology, it is proposed that a kind of to be built based on picture
The method and system of scarcity of resources language comparable corpora.
On the one hand, the embodiment of the invention provides a kind of side that scarcity of resources language comparable corpora is built based on picture
Method, including:
S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage is included in text
Picture;
S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text,
As aboundresources language text;
S130, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes:
The numeral in picture, text issuing time, text, time and name entity in text;
S140, the scarcity of resources language and aboundresources language net based on the feature calculation with same or similar picture
The Similarity value of page;
S150, chooses Similarity value highest aboundresources language text as scarcity of resources language text than text
This;
Repeat S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language can
Untill text.
Preferably, using image searching method money of the search comprising the same or similar picture of scarcity of resources language text
The webpage of source plentiful language.
Preferably, it is further comprising the steps of before S140:Based on transliteration and simple free translation to the numeral in text, time with
And Named entity translation.
Preferably, the scarcity of resources language and aboundresources based on the feature calculation with same or similar picture
The similarity of language web page, specifically calculates the resource with same or similar picture based on the feature according to RBF
The similarity of rare language and aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd
It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
It is further preferred that the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text
Similarity situation of the entity in scarcity of resources language web page and aboundresources language web page, assigns different weighted values respectively.
On the other hand, the embodiment of the invention provides and a kind of scarcity of resources language comparable corpora is built based on picture be
System, including:
Download module, the webpage for downloading scarcity of resources language, and as scarcity of resources language text, the webpage bag
Include the picture in text;
Search module, for searching for the aboundresources language comprising the same or similar picture of scarcity of resources language text
Webpage, as aboundresources language text;
Extraction module, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, the feature
Including:The numeral in picture, text issuing time, text, time and name entity in text;
Computing module, the scarcity of resources language and aboundresources language based on the feature calculation with same or similar picture
Say the Similarity value of webpage;
Module is chosen, the best aboundresources language text of Similarity value is chosen as the comparable of scarcity of resources language text
Text.
Preferably, the search module, the scarcity of resources language is included specifically for the search of application image searching method
The webpage of the aboundresources language of the same or similar picture of text.
Preferably, the system also includes:Translation module, for based on transliteration and simple free translation to the numeral in text,
Time and Named entity translation.
Preferably, the computing module, specifically for based on the feature according to RBF calculate have it is identical or
The scarcity of resources language of similar pictures and the similarity of aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd
It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
It is further preferred that the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text
Similarity situation of the entity in less language web page and more language web page, assigns different weighted values respectively.
The method and system for building scarcity of resources language comparable corpora based on picture provided in an embodiment of the present invention are not received
Some information processing technologies (such as keyword abstraction, machine translation, information retrieval) and resource (bilingual dictionary, wikipedia etc.)
Limitation, can at lower cost, across the language comparable corpora of the high-quality scarcity of resources language of rapid build, and then be resource
The natural language processing of rare language provides resource.
Brief description of the drawings
Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be to that will make needed for embodiment description
Accompanying drawing is briefly described.It should be evident that drawings in the following description are only some embodiments of the present invention.
Fig. 1 is that the method flow for building scarcity of resources language comparable corpora based on picture provided in an embodiment of the present invention shows
It is intended to;
Fig. 2 (a) is the Chinese text citing of the same subject based on picture searching;
Fig. 2 (b) is the English text citing of the same subject based on picture searching;
Fig. 2 (c) is the Arabic text citing of the same subject based on picture searching;
Fig. 2 (d) is Spain's text citing of the same subject based on picture searching;
Fig. 2 (e) is traditional Monggol language text citing of the same subject based on picture searching;
Fig. 2 (f) is the Tibetan language text citing of the same subject based on picture searching;
Fig. 3 is a kind of specific embodiment flow chart provided in an embodiment of the present invention;
Fig. 4 is a kind of system knot that scarcity of resources language comparable corpora is built based on picture provided in an embodiment of the present invention
Structure schematic diagram.
Specific embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In drawings and Examples, the technical scheme in the embodiment of the present invention is explicitly described.
Fig. 1 is provided in an embodiment of the present invention based on a kind of side that scarcity of resources language comparable corpora is built based on picture
Method schematic flow sheet, as described in Figure 1, the method includes:
S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage is included in text
Picture.
S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text,
As aboundresources language text.
Fig. 2 is the multi-language text citing of the same subject based on picture searching, and specifically, (a) is Chinese text, (b)
It is English text, (c) is Arabic text, and (d) is Spain's text, and (e) is traditional Monggol language text, and (f) is Tibetan language text
This.As shown in Figure 2, picture can not be limited by language, intuitively be reacted text subject, possess the different languages of same or similar picture
Speech text is usually same subject, therefore can collect or phase identical with scarcity of resources language text by the method for picture searching
As other language web pages, as aboundresources language text.
S130, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes:
The numeral in picture, text issuing time, text, time and name entity in text.
S140, the scarcity of resources language and aboundresources language net based on the feature calculation with same or similar picture
The Similarity value of page.
It should be noted that also including before S140:Based on transliteration and simple free translation to the numeral in text, the time and
Named entity translation.
Specifically, calculated according to radial direction base (Radial Basis Function, abbreviation RBF) function based on the feature
The similarity of scarcity of resources language and aboundresources language web page with same or similar picture:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd
It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
Wherein, the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text
Similarity situation of the entity in less language web page and more language web page, assigns different weighted values respectively.
S150, chooses the best aboundresources language text of Similarity value as scarcity of resources language text than text
This.
S160, repeats S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language
Untill the comparable text of speech.
Scarcity of resources language is built than data below by how specific example is based on picture to the embodiment of the present invention
The process in storehouse is illustrated.
Fig. 3 is a kind of specific embodiment flow chart provided in an embodiment of the present invention, as shown in figure 3, building money based on picture
The rare language in source is specific as follows than the process of data bank:
S110, downloads the webpage containing scarcity of resources language, and used as scarcity of resources language text, the webpage includes text
Picture in this.
On the internet, the webpage containing scarcity of resources language is downloaded, total quantity is m.
Whether contain pictorial information in k-th webpage for first determining whether scarcity of resources language, if not containing picture, judge
Whether contain picture in kth -- webpage (webpage of kth -1, i.e., next webpage);If containing picture in kth webpage, the webpage
As scarcity of resources language text.
Judge whether all scarcity of resources language web pages containing picture have been processed, if having processed, build money
The rare language comparable corpora in source leaves it at that;If not processed, comparable corpora is carried out to scarcity of resources language text i
Structure.
Processing procedure to scarcity of resources language text i is as follows:
S120, the resource with the scarcity of resources language text same or similar pictures of i is included using the search of picture research tool
The webpage of plentiful language, as aboundresources language text j.
S130, feature extraction is carried out to scarcity of resources language text i and language-specific text j, and the feature includes:Text
In picture, text issuing time, text in numeral, the time and name entity.
Wherein, the picture in text belongs to the outer feature of text, and numeral, time and life in text issuing time, text
Name entity belongs to feature in text.
S140, first, based on transliteration and simple free translation to the numeral in text, time and Named entity translation;Then,
Scarcity of resources language and aboundresources language web page with same or similar picture are calculated according to RBF functions based on the feature
Similarity:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd
It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
Wherein, the weight of the text similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text
Similarity situation of the entity in scarcity of resources language web page and aboundresources language web page, assigns different weighted values respectively.
Such as:At the initial stage of experimentation, can assign respectively time in picture, text issuing time, the text in text, numeral with
And name entity respectively accounts for 1/5 weighted value, the later stage, further according to similar situation, adjusts the weighted value of each feature, untill suitable.
S150, chooses Similarity value highest aboundresources language text as scarcity of resources language text than text
This, is put into the comparable expectation storehouse of scarcity of resources language.
S160, repeats S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language
Untill the comparable text of speech.
Corresponding with above method embodiment, the embodiment of the present invention additionally provides a kind of based on picture structure scarcity of resources
The system of language comparable corpora is specific as shown in figure 4, the system 400 includes:Download module 401, search module 402, extraction
Module 403, computing module 404 and selection module 405.
Download module 401, the webpage for downloading scarcity of resources language, as scarcity of resources language text, the webpage
Including the picture in text.
Search module 402, for searching for the aboundresources comprising the same or similar picture of scarcity of resources language text
The webpage of language, as aboundresources language text.
Extraction module 403, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, the spy
Levy including:The numeral in picture, text issuing time, text, time and name entity in text.
Computing module 404, has the scarcity of resources language and resource of same or similar picture rich based on the feature calculation
The Similarity value of rich language web page.
Module 405 is chosen, Similarity value highest aboundresources language text is chosen as scarcity of resources language text
Than text.
The search module 401, the scarcity of resources language text is included specifically for the search of application image searching method
The webpage of the aboundresources language of same or similar picture.
The system also includes:Translation module 406, for based on transliteration and simple free translation to numeral, the time in text
And Named entity translation.
The computing module 404, specifically for being calculated with same or similar according to RBF based on the feature
The scarcity of resources language of picture and the similarity of aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βd
It is the weight of text similarity feature, σ is the width parameter of function, controls the radial effect scope of function.Wherein, the text
The weight of this similarity feature is obtained in the following manner:
In experimentation, the time in picture, text issuing time, text, numeral and name in text
Similarity situation of the entity in less language web page and more language web page, assigns different weighted values respectively.
Each portion in the above-mentioned system that scarcity of resources language comparable corpora is built based on picture provided in an embodiment of the present invention
One kind that function performed by part is provided in above-described embodiment is based on picture and builds scarcity of resources language comparable corpora
Method in be discussed in detail, repeat no more here.
The system for building scarcity of resources language comparable corpora based on picture provided in an embodiment of the present invention is not believed by some
Breath treatment technology (such as keyword abstraction, machine translation, information retrieval) and the limitation of resource (bilingual dictionary, wikipedia),
Can at lower cost, across the language comparable corpora of the high-quality scarcity of resources language of rapid build, and then be scarcity of resources
The natural language processing of language provides resource.
Professional should further appreciate that, each example described with reference to the embodiments described herein
Unit and algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, hard in order to clearly demonstrate
The interchangeability of part and software, generally describes the composition and step of each example according to function in the above description.
These functions are performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.
Professional and technical personnel can realize described function to each specific application using distinct methods, but this realization is not
It is considered as beyond the scope of this invention.
One of ordinary skill in the art will appreciate that all or part of step in realizing above-described embodiment method can be
Completed come instruction processing unit by program, described program can be stored in computer-readable recording medium, and storage medium is
Non-transitory (non-transitory) medium, such as random access memory, read-only storage, flash memory, hard disk, Gu
State hard disk, tape (magnetic tape), floppy disk (floppy disk), CD (optical disc) and its any combination.
More than, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto.
Claims (10)
1. it is a kind of based on picture build scarcity of resources language comparable corpora method, it is characterised in that including:
S110, downloads the webpage of scarcity of resources language, and used as scarcity of resources language text, the webpage includes the figure in text
Piece;
S120, the webpage of aboundresources language of the search comprising the same or similar picture of scarcity of resources language text, as
Aboundresources language text;
S130, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes:Text
In picture, text issuing time, text in numeral, the time and name entity;
S140, the scarcity of resources language and aboundresources language web page based on the feature calculation with same or similar picture
Similarity value;
S150, chooses comparable text of the Similarity value highest aboundresources language text as scarcity of resources language text;
S160, repeats S120 to S150, until all scarcity of resources language web pages containing picture find aboundresources language
Than untill text.
2. method according to claim 1, it is characterised in that application image searching method search includes the scarcity of resources
The webpage of the aboundresources language of the same or similar picture of language text.
3. method according to claim 1, it is characterised in that further comprising the steps of before S140:
Based on transliteration and simple free translation to the numeral in text, time and Named entity translation.
4. method according to claim 1, it is characterised in that described that same or similar figure is had based on the feature calculation
The scarcity of resources language of piece and the similarity of aboundresources language web page, specifically based on the feature according to RBF meter
Calculator has the scarcity of resources language of same or similar picture and the similarity of aboundresources language web page:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βdIt is text
The weight of this similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
5. method according to claim 4, it is characterised in that the weight of the text similarity feature is in the following manner
Obtain:
In experimentation, the time in picture, text issuing time, text, numeral and name entity in text
Similarity situation in less language web page and more language web page, assigns different weighted values respectively.
6. it is a kind of based on picture build scarcity of resources language comparable corpora system, it is characterised in that including:
Download module, the webpage for downloading scarcity of resources language, and used as scarcity of resources language text, the webpage includes text
Picture in this;
Search module, the net for searching for the aboundresources language comprising the same or similar picture of scarcity of resources language text
Page, as aboundresources language text;
Extraction module, the webpage to the scarcity of resources language and aboundresources language carries out feature extraction, and the feature includes:
The numeral in picture, text issuing time, text, time and name entity in text;
Computing module, the scarcity of resources language and aboundresources language net based on the feature calculation with same or similar picture
The Similarity value of page;
Module is chosen, Similarity value highest aboundresources language text is chosen as scarcity of resources language text than text
This.
7. system according to claim 6, it is characterised in that the search module, specifically for application picture searcher
The webpage of method aboundresources language of the search comprising the same or similar picture of scarcity of resources language text.
8. system according to claim 6, it is characterised in that the system also includes:
Translation module, for based on transliteration and simple free translation to the numeral in text, time and Named entity translation.
9. system according to claim 6, it is characterised in that the computing module, specifically for based on the characteristic root
The similarity of scarcity of resources language and aboundresources language web page with same or similar picture is calculated according to RBF:
Wherein, xidAnd yjdIt is respectively d-th characteristic value of scarcity of resources language text i and aboundresources language text j, βdIt is text
The weight of this similarity feature, σ is the width parameter of function, controls the radial effect scope of function.
10. system according to claim 9, it is characterised in that the weight of the text similarity feature is by with lower section
Formula is obtained:
In experimentation, the time in picture, text issuing time, text, numeral and name entity in text
Similarity situation in less language web page and more language web page, assigns different weighted values respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710047514.6A CN106844648B (en) | 2017-01-22 | 2017-01-22 | A kind of method and system based on picture building scarcity of resources language comparable corpora |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710047514.6A CN106844648B (en) | 2017-01-22 | 2017-01-22 | A kind of method and system based on picture building scarcity of resources language comparable corpora |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106844648A true CN106844648A (en) | 2017-06-13 |
CN106844648B CN106844648B (en) | 2019-07-26 |
Family
ID=59119432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710047514.6A Expired - Fee Related CN106844648B (en) | 2017-01-22 | 2017-01-22 | A kind of method and system based on picture building scarcity of resources language comparable corpora |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106844648B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815390A (en) * | 2018-11-08 | 2019-05-28 | 平安科技(深圳)有限公司 | Search method, device, computer equipment and the computer storage medium of multilingual information |
CN110147817A (en) * | 2019-04-11 | 2019-08-20 | 北京搜狗科技发展有限公司 | Training data set creation method and device |
CN111881900A (en) * | 2020-07-01 | 2020-11-03 | 腾讯科技(深圳)有限公司 | Corpus generation, translation model training and translation method, apparatus, device and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473280A (en) * | 2013-08-28 | 2013-12-25 | 中国科学院合肥物质科学研究院 | Method and device for mining comparable network language materials |
US20150278197A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Constructing Comparable Corpora with Universal Similarity Measure |
CN106202065A (en) * | 2016-06-30 | 2016-12-07 | 中央民族大学 | A kind of across language topic detecting method and system |
-
2017
- 2017-01-22 CN CN201710047514.6A patent/CN106844648B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103473280A (en) * | 2013-08-28 | 2013-12-25 | 中国科学院合肥物质科学研究院 | Method and device for mining comparable network language materials |
US20150278197A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Constructing Comparable Corpora with Universal Similarity Measure |
CN106202065A (en) * | 2016-06-30 | 2016-12-07 | 中央民族大学 | A kind of across language topic detecting method and system |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815390A (en) * | 2018-11-08 | 2019-05-28 | 平安科技(深圳)有限公司 | Search method, device, computer equipment and the computer storage medium of multilingual information |
CN109815390B (en) * | 2018-11-08 | 2023-08-08 | 平安科技(深圳)有限公司 | Method, device, computer equipment and computer storage medium for retrieving multilingual information |
CN110147817A (en) * | 2019-04-11 | 2019-08-20 | 北京搜狗科技发展有限公司 | Training data set creation method and device |
CN110147817B (en) * | 2019-04-11 | 2021-08-27 | 北京搜狗科技发展有限公司 | Training data set generation method and device |
CN111881900A (en) * | 2020-07-01 | 2020-11-03 | 腾讯科技(深圳)有限公司 | Corpus generation, translation model training and translation method, apparatus, device and medium |
CN111881900B (en) * | 2020-07-01 | 2022-08-23 | 腾讯科技(深圳)有限公司 | Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium |
Also Published As
Publication number | Publication date |
---|---|
CN106844648B (en) | 2019-07-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Raulji et al. | Stop-word removal algorithm and its implementation for Sanskrit language | |
US20040167770A1 (en) | Methods and systems for language translation | |
US20070005649A1 (en) | Contextual title extraction | |
JP2003532194A (en) | Computer assisted reading system and method using interlanguage reading wizard | |
Nguyen-Hoang et al. | TSGVi: a graph-based summarization system for Vietnamese documents | |
Vimal Kumar et al. | An improvised extractive approach to hindi text summarization | |
CN106844648A (en) | A kind of method and system that scarcity of resources language comparable corpora is built based on picture | |
Cai et al. | Wikification via link co-occurrence | |
CN106294473B (en) | Entity word mining method, information recommendation method and device | |
Batsuren et al. | A large and evolving cognate database | |
Islam et al. | Towards achieving a delicate blending between rule-based translator and neural machine translator | |
Yeom et al. | Unsupervised-learning-based keyphrase extraction from a single document by the effective combination of the graph-based model and the modified C-value method | |
Awajan | Semantic similarity based approach for reducing Arabic texts dimensionality | |
Görgün et al. | A novel approach to morphological disambiguation for turkish | |
Wu et al. | Learning multilingual topics with neural variational inference | |
Pęzik et al. | Keyword extraction from short texts with a text-to-text transfer transformer | |
JPH05158401A (en) | Document fast reading support/display system and document processor and document retrieving device | |
Rakholia et al. | The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format | |
Mohd et al. | Sumdoc: a unified approach for automatic text summarization | |
Jain et al. | Retrieving web search results using Max–Max soft clustering for Hindi query | |
Atwan et al. | Impact of stemmer on arabic text retrieval | |
Eghbalzadeh et al. | Persica: A Persian corpus for multi-purpose text mining and Natural language processing | |
Zhou et al. | Cross-lingual embeddings with auxiliary topic models | |
Andersson et al. | Exploring patent passage retrieval using nouns phrases | |
Kumar et al. | Design and implementation of rule-based hindi stemmer for hindi information retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190726 Termination date: 20210122 |