CN108197120A - A kind of similar sentence machining system based on bilingual teaching mode - Google Patents

A kind of similar sentence machining system based on bilingual teaching mode Download PDF

Info

Publication number
CN108197120A
CN108197120A CN201711460777.6A CN201711460777A CN108197120A CN 108197120 A CN108197120 A CN 108197120A CN 201711460777 A CN201711460777 A CN 201711460777A CN 108197120 A CN108197120 A CN 108197120A
Authority
CN
China
Prior art keywords
language material
hash value
teaching mode
similar sentence
system based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711460777.6A
Other languages
Chinese (zh)
Inventor
张宏磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Tone Communication Technology Qingdao Co Ltd
Original Assignee
Global Tone Communication Technology Qingdao Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Tone Communication Technology Qingdao Co Ltd filed Critical Global Tone Communication Technology Qingdao Co Ltd
Priority to CN201711460777.6A priority Critical patent/CN108197120A/en
Publication of CN108197120A publication Critical patent/CN108197120A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A kind of similar sentence machining system based on bilingual teaching mode, includes the following steps:Networking, parallel corpora is obtained by network;System pre-processes language material;Randomly select the original text of language material and corresponding translation;Hash value is calculated by the computing system of system;System compares the hash value and the hash value in language material storage device, it is screened, by undesirable removal, it is satisfactory to be stored into corresponding reservoir, the storage resource of language material storage device is greatly reduced by the method for the present invention, improves the working efficiency and translation quality of translation system.

Description

A kind of similar sentence machining system based on bilingual teaching mode
Technical field
The present invention relates to a kind of similar sentence machining systems based on bilingual teaching mode.
Background technology
Currently, one of important collection method of corpus language material is obtained automatically by network, but is existed in a network A large amount of repetitions or similar sentence.The required bilingual sentence pair of machine translation system based on Parallel Corpus is generally million Grade or more, if the sentence of these redundancies is put into Parallel Corpus, storage resource can be not only wasted, but also can influence to translate The working efficiency and translation quality of system.Therefore in the previous work of structure corpus, according to sentence remove it is a large amount of repeat or Similar sentence pair is a job with practical significance.
Invention content
For above deficiency, the present invention provides a kind of similar sentence machining system based on bilingual teaching mode, required The technical solution adopted is that
A kind of similar sentence machining system based on bilingual teaching mode, includes the following steps:
(1)Networking, parallel corpora is obtained by network;
(2)System pre-processes language material;
(3)Randomly select the original text of language material and corresponding translation;
(4)Remove additional character, oeprator, the number in text;
(5)Hash value is calculated by the computing system of system;
(6)Hash value in the hash value and language material storage device is subjected to comparison operation, if hash value is set with language material storage Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material It is stored in together in language material storage device with hash value.
The present invention is also needed using further technical solution while using above technical scheme,
The step(2)It is that label is removed to language material, in some preferred modes, first removes the language material obtained by network The label of other systems, then the label of language material is removed.
The language material storage device includes reservoir and comparator, and in some preferred modes, reservoir can be by not Classify with languages.
The comparator includes input terminal, compares end and output terminal, and input terminal is connected with computing system.
The method further includes the step of language material in final data library is marked, and the language material of same meaning is pressed not It is marked with languages, the language material of same languages is marked by different meanings, in some preferred modes, by same language Kind is assigned in the reservoir of the languages, and the different language for having same meaning is marked and is stored in parallel reservoir In.
The invention has the advantages that similar sentence pair will not all be stored in language material in this system, reduce corpus In redundancy language material, greatly reduce the storage resource of language material storage device, improve the working efficiency of translation system and turn over Translate quality.
Description of the drawings
Fig. 1 is the flow chart of the present invention.
Specific embodiment
The present invention is described further below in conjunction with the accompanying drawings,
A kind of similar sentence machining system based on bilingual teaching mode, includes the following steps:
(1)Networking, parallel corpora is obtained by network;
(2)System pre-processes language material, first by the label of the language material obtained by network removal other systems, then by language material Label remove.
(3)Randomly select the original text of language material and corresponding translation;
(4)Remove additional character, oeprator, the number in text;
(5)The hash value of the language material is calculated by the computing system of system,;
(6)Hash value in the hash value and language material storage device is subjected to comparison operation, if hash value is set with language material storage Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material It is stored in together in language material storage device with hash value.
The language material storage device includes reservoir and comparator, and in some preferred modes, reservoir can be by not Classify with languages.
The comparator includes input terminal, compares end and output terminal, and input terminal is connected with computing system.
The method further includes the step of language material in final data library is marked, and the language material of same meaning is pressed not It is marked with languages, the language material of same languages by different meanings is marked, same languages are assigned to the storage of the languages In storage, the different language for having same meaning is marked and is stored in parallel reservoir.
By taking English as an example, English and its Chinese translation are obtained by network, remove original label and label, is used Certain algorithm extracts corresponding part in English components and Chinese translation in English language material, will be special in English language material The removals such as symbol, oeprator, number obtain pure English data, by the computing system of system to the hash value of the language material into Row calculates, and the hash value in the hash value and language material storage device is carried out comparison operation, if hash value is set with language material storage Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material It is stored in together in language material storage device with hash value, same label is carried out to this two in storage, then respectively by English language Material is stored in corresponding English reservoir, and corresponding Chinese data is stored in corresponding Chinese reservoir.

Claims (5)

1. a kind of similar sentence machining system based on bilingual teaching mode, which is characterized in that include the following steps:
(1)Networking, parallel corpora is obtained by network;
(2)System pre-processes language material;
(3)Randomly select the original text of language material and corresponding translation;
(4)Remove additional character, oeprator, the number in text;
(5)Hash value is calculated by the computing system of system;
(6)Hash value in the hash value and language material storage device is subjected to comparison operation, if hash value is set with language material storage Hash value in standby is identical, then without storage, if hash value is different from the hash value in language material storage device, by language material It is stored in together in language material storage device with hash value.
2. a kind of similar sentence machining system based on bilingual teaching mode according to claim 1, which is characterized in that The step(2)It is that label is removed to language material.
3. a kind of similar sentence machining system based on bilingual teaching mode according to claim 1, which is characterized in that The language material storage device includes reservoir and comparator.
4. a kind of similar sentence machining system based on bilingual teaching mode according to claim 3, which is characterized in that The comparator includes input terminal, compares end and output terminal, and input terminal is connected with computing system.
5. a kind of similar sentence machining system based on bilingual teaching mode according to claim 1, which is characterized in that The method further includes the step of language material in final data library is marked, to the language material of same meaning by different language into Line flag presses different meanings into rower to the language material of same languages.
CN201711460777.6A 2017-12-28 2017-12-28 A kind of similar sentence machining system based on bilingual teaching mode Pending CN108197120A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711460777.6A CN108197120A (en) 2017-12-28 2017-12-28 A kind of similar sentence machining system based on bilingual teaching mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711460777.6A CN108197120A (en) 2017-12-28 2017-12-28 A kind of similar sentence machining system based on bilingual teaching mode

Publications (1)

Publication Number Publication Date
CN108197120A true CN108197120A (en) 2018-06-22

Family

ID=62585206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711460777.6A Pending CN108197120A (en) 2017-12-28 2017-12-28 A kind of similar sentence machining system based on bilingual teaching mode

Country Status (1)

Country Link
CN (1) CN108197120A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414648A (en) * 2020-03-04 2020-07-14 传神语联网网络科技股份有限公司 Corpus authentication method and apparatus

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081805A (en) * 1997-09-10 2000-06-27 Netscape Communications Corporation Pass-through architecture via hash techniques to remove duplicate query results
CN1916889A (en) * 2005-08-19 2007-02-21 株式会社日立制作所 Language material storage preparation device and its method
US8099415B2 (en) * 2006-09-08 2012-01-17 Simply Hired, Inc. Method and apparatus for assessing similarity between online job listings
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion
US20130232160A1 (en) * 2012-03-02 2013-09-05 Semmle Limited Finding duplicate passages of text in a collection of text
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104978311A (en) * 2015-07-15 2015-10-14 昆明理工大学 Vietnamese word segmentation method based on conditional random fields

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6081805A (en) * 1997-09-10 2000-06-27 Netscape Communications Corporation Pass-through architecture via hash techniques to remove duplicate query results
CN1916889A (en) * 2005-08-19 2007-02-21 株式会社日立制作所 Language material storage preparation device and its method
US8099415B2 (en) * 2006-09-08 2012-01-17 Simply Hired, Inc. Method and apparatus for assessing similarity between online job listings
US20130232160A1 (en) * 2012-03-02 2013-09-05 Semmle Limited Finding duplicate passages of text in a collection of text
CN102799647A (en) * 2012-06-30 2012-11-28 华为技术有限公司 Method and device for webpage reduplication deletion
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104978311A (en) * 2015-07-15 2015-10-14 昆明理工大学 Vietnamese word segmentation method based on conditional random fields

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王东波等: "英汉双语句子级平行语料库自动构建", 《现代图书情报技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414648A (en) * 2020-03-04 2020-07-14 传神语联网网络科技股份有限公司 Corpus authentication method and apparatus
CN111414648B (en) * 2020-03-04 2023-05-12 传神语联网网络科技股份有限公司 Corpus authentication method and device

Similar Documents

Publication Publication Date Title
CN103455475B (en) Composition method, equipment and system
US20170308526A1 (en) Compcuter Implemented machine translation apparatus and machine translation method
CN104881406A (en) Web page translation method and system
CN105243055A (en) Multi-language based word segmentation method and apparatus
CN112507666B (en) Document conversion method, device, electronic equipment and storage medium
CN101539910A (en) A sentence taking method for computer aided translation and system thereof
CN106933782A (en) A kind of comparison method and device of textual resources file
CN110807338B (en) English-Chinese machine translation term consistency self-correcting system and method
CN104331400B (en) A kind of Mongolian code conversion method and device
CN102779161B (en) Semantic labeling method based on resource description framework (RDF) knowledge base
CN108197120A (en) A kind of similar sentence machining system based on bilingual teaching mode
CN110222234B (en) Video classification method and device
CN105069001A (en) Computer aided translation method
CN105653516B (en) The method and apparatus of parallel corpora alignment
CN113569119A (en) Multi-modal machine learning-based news webpage text extraction system and method
CN113836947A8 (en) Method, device, equipment and storage medium for translating terms after machine translation
Acs et al. Hunaccent: Small footprint diacritic restoration for social media
Lin et al. Combining a segmentation-like approach and a density-based approach in content extraction
CN109446198A (en) A kind of trie tree node compression method and device based on even numbers group
CN102521217A (en) Tibetan composing method and system
CN110069780B (en) Specific field text-based emotion word recognition method
Hajmohammadi et al. Density based active self-training for cross-lingual sentiment classification
CN110737748A (en) text duplicate removal method and system
CN104699670A (en) File splitting method and device
CN104978309B (en) A kind of determination method and apparatus that translation is abnormal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180622

RJ01 Rejection of invention patent application after publication