CN108197120A - A kind of similar sentence machining system based on bilingual teaching mode - Google Patents
A kind of similar sentence machining system based on bilingual teaching mode Download PDFInfo
- Publication number
- CN108197120A CN108197120A CN201711460777.6A CN201711460777A CN108197120A CN 108197120 A CN108197120 A CN 108197120A CN 201711460777 A CN201711460777 A CN 201711460777A CN 108197120 A CN108197120 A CN 108197120A
- Authority
- CN
- China
- Prior art keywords
- language material
- hash value
- teaching mode
- similar sentence
- system based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
A kind of similar sentence machining system based on bilingual teaching mode, includes the following steps:Networking, parallel corpora is obtained by network;System pre-processes language material;Randomly select the original text of language material and corresponding translation;Hash value is calculated by the computing system of system;System compares the hash value and the hash value in language material storage device, it is screened, by undesirable removal, it is satisfactory to be stored into corresponding reservoir, the storage resource of language material storage device is greatly reduced by the method for the present invention, improves the working efficiency and translation quality of translation system.
Description
Technical field
The present invention relates to a kind of similar sentence machining systems based on bilingual teaching mode.
Background technology
Currently, one of important collection method of corpus language material is obtained automatically by network, but is existed in a network
A large amount of repetitions or similar sentence.The required bilingual sentence pair of machine translation system based on Parallel Corpus is generally million
Grade or more, if the sentence of these redundancies is put into Parallel Corpus, storage resource can be not only wasted, but also can influence to translate
The working efficiency and translation quality of system.Therefore in the previous work of structure corpus, according to sentence remove it is a large amount of repeat or
Similar sentence pair is a job with practical significance.
Invention content
For above deficiency, the present invention provides a kind of similar sentence machining system based on bilingual teaching mode, required
The technical solution adopted is that
A kind of similar sentence machining system based on bilingual teaching mode, includes the following steps:
(1)Networking, parallel corpora is obtained by network;
(2)System pre-processes language material;
(3)Randomly select the original text of language material and corresponding translation;
(4)Remove additional character, oeprator, the number in text;
(5)Hash value is calculated by the computing system of system;
(6)Hash value in the hash value and language material storage device is subjected to comparison operation, if hash value is set with language material storage
Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material
It is stored in together in language material storage device with hash value.
The present invention is also needed using further technical solution while using above technical scheme,
The step(2)It is that label is removed to language material, in some preferred modes, first removes the language material obtained by network
The label of other systems, then the label of language material is removed.
The language material storage device includes reservoir and comparator, and in some preferred modes, reservoir can be by not
Classify with languages.
The comparator includes input terminal, compares end and output terminal, and input terminal is connected with computing system.
The method further includes the step of language material in final data library is marked, and the language material of same meaning is pressed not
It is marked with languages, the language material of same languages is marked by different meanings, in some preferred modes, by same language
Kind is assigned in the reservoir of the languages, and the different language for having same meaning is marked and is stored in parallel reservoir
In.
The invention has the advantages that similar sentence pair will not all be stored in language material in this system, reduce corpus
In redundancy language material, greatly reduce the storage resource of language material storage device, improve the working efficiency of translation system and turn over
Translate quality.
Description of the drawings
Fig. 1 is the flow chart of the present invention.
Specific embodiment
The present invention is described further below in conjunction with the accompanying drawings,
A kind of similar sentence machining system based on bilingual teaching mode, includes the following steps:
(1)Networking, parallel corpora is obtained by network;
(2)System pre-processes language material, first by the label of the language material obtained by network removal other systems, then by language material
Label remove.
(3)Randomly select the original text of language material and corresponding translation;
(4)Remove additional character, oeprator, the number in text;
(5)The hash value of the language material is calculated by the computing system of system,;
(6)Hash value in the hash value and language material storage device is subjected to comparison operation, if hash value is set with language material storage
Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material
It is stored in together in language material storage device with hash value.
The language material storage device includes reservoir and comparator, and in some preferred modes, reservoir can be by not
Classify with languages.
The comparator includes input terminal, compares end and output terminal, and input terminal is connected with computing system.
The method further includes the step of language material in final data library is marked, and the language material of same meaning is pressed not
It is marked with languages, the language material of same languages by different meanings is marked, same languages are assigned to the storage of the languages
In storage, the different language for having same meaning is marked and is stored in parallel reservoir.
By taking English as an example, English and its Chinese translation are obtained by network, remove original label and label, is used
Certain algorithm extracts corresponding part in English components and Chinese translation in English language material, will be special in English language material
The removals such as symbol, oeprator, number obtain pure English data, by the computing system of system to the hash value of the language material into
Row calculates, and the hash value in the hash value and language material storage device is carried out comparison operation, if hash value is set with language material storage
Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material
It is stored in together in language material storage device with hash value, same label is carried out to this two in storage, then respectively by English language
Material is stored in corresponding English reservoir, and corresponding Chinese data is stored in corresponding Chinese reservoir.
Claims (5)
1. a kind of similar sentence machining system based on bilingual teaching mode, which is characterized in that include the following steps:
(1)Networking, parallel corpora is obtained by network;
(2)System pre-processes language material;
(3)Randomly select the original text of language material and corresponding translation;
(4)Remove additional character, oeprator, the number in text;
(5)Hash value is calculated by the computing system of system;
(6)Hash value in the hash value and language material storage device is subjected to comparison operation, if hash value is set with language material storage
Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material
It is stored in together in language material storage device with hash value.
2. a kind of similar sentence machining system based on bilingual teaching mode according to claim 1, which is characterized in that
The step(2)It is that label is removed to language material.
3. a kind of similar sentence machining system based on bilingual teaching mode according to claim 1, which is characterized in that
The language material storage device includes reservoir and comparator.
4. a kind of similar sentence machining system based on bilingual teaching mode according to claim 3, which is characterized in that
The comparator includes input terminal, compares end and output terminal, and input terminal is connected with computing system.
5. a kind of similar sentence machining system based on bilingual teaching mode according to claim 1, which is characterized in that
The method further includes the step of language material in final data library is marked, to the language material of same meaning by different language into
Line flag presses different meanings into rower to the language material of same languages.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711460777.6A CN108197120A (en) | 2017-12-28 | 2017-12-28 | A kind of similar sentence machining system based on bilingual teaching mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711460777.6A CN108197120A (en) | 2017-12-28 | 2017-12-28 | A kind of similar sentence machining system based on bilingual teaching mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108197120A true CN108197120A (en) | 2018-06-22 |
Family
ID=62585206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711460777.6A Pending CN108197120A (en) | 2017-12-28 | 2017-12-28 | A kind of similar sentence machining system based on bilingual teaching mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108197120A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111414648A (en) * | 2020-03-04 | 2020-07-14 | 传神语联网网络科技股份有限公司 | Corpus authentication method and apparatus |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6081805A (en) * | 1997-09-10 | 2000-06-27 | Netscape Communications Corporation | Pass-through architecture via hash techniques to remove duplicate query results |
CN1916889A (en) * | 2005-08-19 | 2007-02-21 | 株式会社日立制作所 | Language material storage preparation device and its method |
US8099415B2 (en) * | 2006-09-08 | 2012-01-17 | Simply Hired, Inc. | Method and apparatus for assessing similarity between online job listings |
CN102799647A (en) * | 2012-06-30 | 2012-11-28 | 华为技术有限公司 | Method and device for webpage reduplication deletion |
US20130232160A1 (en) * | 2012-03-02 | 2013-09-05 | Semmle Limited | Finding duplicate passages of text in a collection of text |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
CN104978311A (en) * | 2015-07-15 | 2015-10-14 | 昆明理工大学 | Vietnamese word segmentation method based on conditional random fields |
-
2017
- 2017-12-28 CN CN201711460777.6A patent/CN108197120A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6081805A (en) * | 1997-09-10 | 2000-06-27 | Netscape Communications Corporation | Pass-through architecture via hash techniques to remove duplicate query results |
CN1916889A (en) * | 2005-08-19 | 2007-02-21 | 株式会社日立制作所 | Language material storage preparation device and its method |
US8099415B2 (en) * | 2006-09-08 | 2012-01-17 | Simply Hired, Inc. | Method and apparatus for assessing similarity between online job listings |
US20130232160A1 (en) * | 2012-03-02 | 2013-09-05 | Semmle Limited | Finding duplicate passages of text in a collection of text |
CN102799647A (en) * | 2012-06-30 | 2012-11-28 | 华为技术有限公司 | Method and device for webpage reduplication deletion |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
CN104978311A (en) * | 2015-07-15 | 2015-10-14 | 昆明理工大学 | Vietnamese word segmentation method based on conditional random fields |
Non-Patent Citations (1)
Title |
---|
王东波等: "英汉双语句子级平行语料库自动构建", 《现代图书情报技术》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111414648A (en) * | 2020-03-04 | 2020-07-14 | 传神语联网网络科技股份有限公司 | Corpus authentication method and apparatus |
CN111414648B (en) * | 2020-03-04 | 2023-05-12 | 传神语联网网络科技股份有限公司 | Corpus authentication method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103455475B (en) | Composition method, equipment and system | |
US20170308526A1 (en) | Compcuter Implemented machine translation apparatus and machine translation method | |
CN104881406A (en) | Web page translation method and system | |
CN105243055A (en) | Multi-language based word segmentation method and apparatus | |
CN112507666B (en) | Document conversion method, device, electronic equipment and storage medium | |
CN101539910A (en) | A sentence taking method for computer aided translation and system thereof | |
CN106933782A (en) | A kind of comparison method and device of textual resources file | |
CN110807338B (en) | English-Chinese machine translation term consistency self-correcting system and method | |
CN104331400B (en) | A kind of Mongolian code conversion method and device | |
CN102779161B (en) | Semantic labeling method based on resource description framework (RDF) knowledge base | |
CN108197120A (en) | A kind of similar sentence machining system based on bilingual teaching mode | |
CN110222234B (en) | Video classification method and device | |
CN105069001A (en) | Computer aided translation method | |
CN105653516B (en) | The method and apparatus of parallel corpora alignment | |
CN113569119A (en) | Multi-modal machine learning-based news webpage text extraction system and method | |
CN113836947A8 (en) | Method, device, equipment and storage medium for translating terms after machine translation | |
Acs et al. | Hunaccent: Small footprint diacritic restoration for social media | |
Lin et al. | Combining a segmentation-like approach and a density-based approach in content extraction | |
CN109446198A (en) | A kind of trie tree node compression method and device based on even numbers group | |
CN102521217A (en) | Tibetan composing method and system | |
CN110069780B (en) | Specific field text-based emotion word recognition method | |
Hajmohammadi et al. | Density based active self-training for cross-lingual sentiment classification | |
CN110737748A (en) | text duplicate removal method and system | |
CN104699670A (en) | File splitting method and device | |
CN104978309B (en) | A kind of determination method and apparatus that translation is abnormal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180622 |
|
RJ01 | Rejection of invention patent application after publication |