CN110321532A - Language pre-processes punctuate method, computer equipment and computer readable storage medium - Google Patents

Language pre-processes punctuate method, computer equipment and computer readable storage medium Download PDF

Info

Publication number
CN110321532A
CN110321532A CN201910493707.3A CN201910493707A CN110321532A CN 110321532 A CN110321532 A CN 110321532A CN 201910493707 A CN201910493707 A CN 201910493707A CN 110321532 A CN110321532 A CN 110321532A
Authority
CN
China
Prior art keywords
punctuate
language
symbol
processes
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910493707.3A
Other languages
Chinese (zh)
Inventor
王怡景
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Translation (chengdu) Information Technology Co Ltd
Original Assignee
Digital Translation (chengdu) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Translation (chengdu) Information Technology Co Ltd filed Critical Digital Translation (chengdu) Information Technology Co Ltd
Priority to CN201910493707.3A priority Critical patent/CN110321532A/en
Publication of CN110321532A publication Critical patent/CN110321532A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

A kind of language pretreatment punctuate method computer equipment and computer readable storage medium, the language pretreatment punctuate method includes: parsing document to be processed, obtains at least one subdocument;The subdocument is combined, obtains punctuate collection, it includes at least one punctuate symbol in the punctuate that it includes at least one punctuate that the punctuate, which is concentrated,;The punctuate symbol for including in the punctuate is obtained, and is arranged according to predetermined order;The punctuate that the punctuate is concentrated is recycled according to the predetermined order, obtains the position where current punctuate symbol;According to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;The processing collection is recycled, result set is obtained.Language provided herein pre-processes punctuate method, is optimized to punctuate, realizes most efficient punctuate method, clear thinking.Meanwhile guaranteeing that the complete sentence for reading and translating, quality with higher is finally presented in the more different original text category of language of the sentence after parsing.

Description

Language pre-processes punctuate method, computer equipment and computer readable storage medium
Technical field
This application involves intelligent language process fields, and in particular to a kind of language pretreatment punctuate method, computer equipment And computer readable storage medium.
Background technique
The problem of WebCAT (computer-aided translation of webpage version) is many kinds of on present society, and CAT is converted.It is wherein main Just comprising parsing, punctuate, the part of duplicate removal.Parsing has third party's business men to provide frame support (such as itext, poi etc.).Punctuate and Duplicate removal but without standard, causes Language Processing inaccurate.
Summary of the invention
The main purpose of the application is to provide a kind of language pretreatment punctuate method, not marked with solving punctuate and duplicate removal Standard leads to the problem of Language Processing inaccuracy.
To achieve the goals above, according to the one aspect of the application, a kind of language pretreatment punctuate method, packet are provided It includes: parsing document to be processed, obtain at least one subdocument;The subdocument is combined, punctuate collection is obtained, the punctuate concentrates packet It include at least one punctuate symbol in the punctuate containing at least one punctuate;The punctuate symbol for including in the punctuate is obtained, and It is arranged according to predetermined order;The punctuate that the punctuate is concentrated is recycled according to the predetermined order, obtains current punctuate symbol The position at place;According to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;Circulation institute Processing collection is stated, result set is obtained.
Optionally, according to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing Ji Bao It includes: when current character is if it is abbreviation, then not making pauses in reading unpunctuated ancient writings;When being English in original text, character is not if being that space increasing is write after judgement When female, then do not make pauses in reading unpunctuated ancient writings;When current character is number, punctuate symbol is point number, when rear character is not number, is then made pauses in reading unpunctuated ancient writings;Work as punctuate Symbol end followed by "or" when, then do not make pauses in reading unpunctuated ancient writings, punctuate symbol position elapses one backward.
Optionally, the language pre-processes punctuate method further include: when it includes Global Information that the punctuate, which is concentrated, by institute It states Global Information and is added directly to result set.
Optionally, the Global Information are as follows: hyperlink, number, time, period or mail address.
Optionally, language pre-processes punctuate method further include: judges the language of the punctuate.
Optionally, the language of the punctuate is met collection with preset punctuate to match.
Optionally, the processing collection is recycled, obtaining result set includes: by the sentence of the ending of punctuate symbol or symbol of having made pauses in reading unpunctuated ancient writings Result set is added in the sentence of number followed by "or" ending, by remaining sentence after this sentence punctuate be added next sentence into Row circulation.
To achieve the goals above, according to the one aspect of the application, a kind of computer equipment, including storage are additionally provided Device, processor and storage in the memory and the computer program that can be run by the processor, the processor execution Method described in any of the above embodiments is realized when the computer program.
To achieve the goals above, according to the one aspect of the application, a kind of computer readable storage medium is additionally provided, Non-volatile readable storage medium, is stored with computer program, and the computer program is realized when executed by the processor Method described in any of the above embodiments.
To achieve the goals above, according to the one aspect of the application, a kind of computer program product is additionally provided, including Computer-readable code causes the computer equipment to execute when the computer-readable code is executed by computer equipment Method described in any of the above embodiments.
Language provided herein pre-processes punctuate method, is optimized to punctuate, realizes most efficient punctuate Method, clear thinking.Meanwhile it is complete for reading to guarantee that the more different original text category of language of the sentence after parsing is finally presented With the sentence of translation, quality with higher.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not Constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the overall flow figure that punctuate method is pre-processed according to the language of one embodiment of the application;
Fig. 2 is the flow chart that punctuate method is pre-processed according to the language of one embodiment of the application;
Fig. 3 is the system schematic according to the computer equipment of the application one embodiment;And
Fig. 4 is the schematic diagram according to the computer readable storage medium of the application one embodiment.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
In addition, term " installation ", " setting ", " being equipped with ", " connection ", " connected ", " socket " shall be understood in a broad sense.For example, It may be a fixed connection, be detachably connected or monolithic construction;It can be mechanical connection, or electrical connection;It can be direct phase It even, or indirectly connected through an intermediary, or is two connections internal between device, element or component. For those of ordinary skills, the concrete meaning of above-mentioned term in this application can be understood as the case may be.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As Figure 1-Figure 2, this application involves a kind of language to pre-process punctuate method, comprising:
S102: parsing document to be processed, obtains at least one subdocument;
S104: combining the subdocument, obtains punctuate collection, and the punctuate is concentrated comprising at least one punctuate, the punctuate In include at least one punctuate symbol;
S106: the punctuate symbol for including in the punctuate is obtained, and is arranged according to predetermined order;
S108: recycling the punctuate that the punctuate is concentrated according to the predetermined order, obtains current punctuate symbol place Position;
S110: according to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;
S112: the circulation processing collection obtains result set.
It is parsed when in use, document (such as Word, PPT etc.) is first uploaded to editing machine, editing machine is corresponding document It is parsed into irregular not carefully and neatly done a words or sentence.It such as should be that in short may resolve to multiple words, or one A English word resolves to several letters.Such as: one is that " I has a book in original text word.You also have.He is also.", but It is may to be parsed by third square bearer as " I ", " having one ", " book.You ", " also have.He is also." four words.At this moment Him will be handled by waiting.
Then the irregular sentence that combined analysis comes out, each a minimum of punctuate of sentence accords with after guaranteeing combination when combination Number, this is write sentence and is stored in for making pauses in reading unpunctuated ancient writings in a list, this list is known as collection of making pauses in reading unpunctuated ancient writings by we.Such as: parsing Document carry out arrangement merging, as soon as guarantee a minimum of punctuate symbol of each sentence, then arrange after as follows " I has one Book.You ", " also have.He is also." two words, purpose the step of being simplified subsequent punctuate of this operation.
The punctuate symbol for including in this then sequencing sequence is obtained, then by punctuate symbol sequence circulation the words Make pauses in reading unpunctuated ancient writings, obtains the position where current punctuate symbol.It in the present embodiment, further include judgement between punctuate symbol obtaining Original text language obtains the punctuate symbol for including in this then sequencing sequence further according to original text language, then by punctuate symbol Number sequence circulation the words make pauses in reading unpunctuated ancient writings, obtain the position where current punctuate symbol.Such as: in the original text of English be: " I! It is king.You ".This punctuate will make pauses in reading unpunctuated ancient writings, and first be ranked up to punctuate symbol, and the punctuate symbol the inside as in English has {"!","!", "? ", "? ", " ... ", " ", ".","···"};, punctuate symbol is after handling at this time {"!", "."}.
In one embodiment of the application, according to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and When to obtain processing collection include: current character if it is abbreviation, then do not make pauses in reading unpunctuated ancient writings;When being English in original text, character is not if being after judgement When space adds capitalization, then do not make pauses in reading unpunctuated ancient writings;Current character is number, and punctuate symbol is point number, when rear character is not number, is then broken Sentence;When punctuate symbol end followed by "or", then do not make pauses in reading unpunctuated ancient writings, punctuate symbol position elapses one backward.It opens at this time Begin formal punctuate, and the punctuate assemble of symbol and original text provided according to previous step is made pauses in reading unpunctuated ancient writings: as original text is: " I!It is king. You ", punctuate symbol be "!", "."}.Become " I after this step!", it " is king." two words, a remaining word will addition Lower a word is made pauses in reading unpunctuated ancient writings.
Circular treatment collection, punctuate symbol has been followed by the sentence of the ending of punctuate symbol or " or)] sentence of ending Result set is added to recycle sentence addition next sentence remaining after this sentence punctuate.
This step is exactly to be in order to which special scenes optimize the form made pauses in reading unpunctuated ancient writings, as " (hello!) ", such case!With) quality inspection is not It will disconnect, in addition dismantle last several words and can be added down and in short make pauses in reading unpunctuated ancient writings.
In one embodiment of the application, the language pre-processes punctuate method further include: it includes whole for concentrating when the punctuate When body information, the Global Information is added directly to result set.In the present embodiment, Global Information are as follows: hyperlink, number, Time, period or mail address, but be not limited thereto.Judge whether this sentence is hyperlink, number, period, the time, mail Deng if this sentence without subsequent processing, is directly appended to result set.Such as: the sentence parsed is 2019 or surpasses Link etc., would not carry out making pauses in reading unpunctuated ancient writings, these not will do it translation, original text can be copied directly at translation.
In one embodiment of the application, language pre-processes punctuate method further include: judges the language of the punctuate.
In one embodiment of the application, the language of the punctuate is met into collection with preset punctuate and is matched.
In one embodiment of the application, the processing collection is recycled, obtaining result set includes: the sentence that punctuate symbol ends up Result set is added in the sentence of son or punctuate symbol followed by "or" ending, and remaining sentence after this sentence punctuate is added Next sentence is recycled.
As shown in figure 3, the application also provides a kind of computer equipment, including memory, processor and it is stored in described deposit In reservoir and the computer program that can be run by the processor, when processor execution computer program, is realized above-mentioned Described in any item methods.
As shown in figure 4, the application also provides a kind of computer readable storage medium, non-volatile readable storage medium, It is inside stored with computer program, the computer program realizes method described in any of the above embodiments when executed by the processor.
The application also provides a kind of computer program product, including computer-readable code, when the computer-readable generation When code is executed by computer equipment, the computer equipment is caused to execute method described in any of the above embodiments.
Language provided herein pre-processes punctuate method adapted relationships type database, is suitable for existing all programming languages Speech realizes the punctuate step in pretreatment, the text-processing after parsing at can normally read high quality cypher text.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims (10)

1. a kind of language pre-processes punctuate method characterized by comprising
Document to be processed is parsed, at least one subdocument is obtained;
The subdocument is combined, obtains collection of making pauses in reading unpunctuated ancient writings, it includes at least in the punctuate that it includes at least one punctuate that the punctuate, which is concentrated, One punctuate symbol;
The punctuate symbol for including in the punctuate is obtained, and is arranged according to predetermined order;
The punctuate that the punctuate is concentrated is recycled according to the predetermined order, obtains the position where current punctuate symbol;
According to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;
The processing collection is recycled, result set is obtained.
2. language according to claim 1 pre-processes punctuate method, which is characterized in that before and after punctuate character position Character and preset rules are made pauses in reading unpunctuated ancient writings again, and are obtained processing collection and included:
When current character is if it is abbreviation, then do not make pauses in reading unpunctuated ancient writings;
When being English in original text, when character is not if being that space adds capitalization after judgement, then do not make pauses in reading unpunctuated ancient writings;
When current character is number, punctuate symbol is point number, when rear character is not number, is then made pauses in reading unpunctuated ancient writings;
When punctuate symbol end followed by "or", then do not make pauses in reading unpunctuated ancient writings, punctuate symbol position elapses one backward.
3. language according to claim 1 pre-processes punctuate method, which is characterized in that the language pre-processes punctuate method Further include: when it includes Global Information that the punctuate, which is concentrated, the Global Information is added directly to result set.
4. language according to claim 3 pre-processes punctuate method, which is characterized in that the Global Information are as follows: hyperlink, Number, time, period or mail address.
5. language according to claim 1 pre-processes punctuate method, which is characterized in that language pretreatment punctuate method is also wrapped It includes:
Judge the language of the punctuate.
6. language according to claim 5 pre-processes punctuate method, which is characterized in that by the language of the punctuate and preset Punctuate meet collection matching.
7. language according to claim 1 pre-processes punctuate method, which is characterized in that the circulation processing collection is tied Fruit collection includes: that result set is added in the sentence of the sentence of the ending of punctuate symbol or punctuate symbol followed by "or" ending, Next sentence is added in remaining sentence after this sentence punctuate to recycle.
8. a kind of computer equipment, including memory, processor and storage can be transported in the memory and by the processor Capable computer program, which is characterized in that the processor is realized when executing the computer program as appointed in claim 1-7 Method described in one.
9. a kind of computer readable storage medium, non-volatile readable storage medium are stored with computer program, feature It is, the computer program realizes such as method of any of claims 1-7 when executed by the processor.
10. a kind of computer program product, including computer-readable code, which is characterized in that when the computer-readable code When being executed by computer equipment, the computer equipment perform claim is caused to require method described in any one of 1-7.
CN201910493707.3A 2019-06-06 2019-06-06 Language pre-processes punctuate method, computer equipment and computer readable storage medium Pending CN110321532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910493707.3A CN110321532A (en) 2019-06-06 2019-06-06 Language pre-processes punctuate method, computer equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910493707.3A CN110321532A (en) 2019-06-06 2019-06-06 Language pre-processes punctuate method, computer equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110321532A true CN110321532A (en) 2019-10-11

Family

ID=68120864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910493707.3A Pending CN110321532A (en) 2019-06-06 2019-06-06 Language pre-processes punctuate method, computer equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110321532A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347757A (en) * 2020-10-12 2021-02-09 四川语言桥信息技术有限公司 Parallel corpus alignment method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107564526A (en) * 2017-07-28 2018-01-09 北京搜狗科技发展有限公司 Processing method, device and machine readable media
CN107832308A (en) * 2017-12-11 2018-03-23 中译语通科技股份有限公司 A kind of punctuate method and system of machine translation, computer program, computer
CN108073572A (en) * 2016-11-16 2018-05-25 北京搜狗科技发展有限公司 Information processing method and its device, simultaneous interpretation system
CN108628819A (en) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN109284503A (en) * 2018-10-22 2019-01-29 传神语联网网络科技股份有限公司 Translate Statement Completion judgment method and system
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073572A (en) * 2016-11-16 2018-05-25 北京搜狗科技发展有限公司 Information processing method and its device, simultaneous interpretation system
CN108628819A (en) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN107564526A (en) * 2017-07-28 2018-01-09 北京搜狗科技发展有限公司 Processing method, device and machine readable media
CN107832308A (en) * 2017-12-11 2018-03-23 中译语通科技股份有限公司 A kind of punctuate method and system of machine translation, computer program, computer
CN109284503A (en) * 2018-10-22 2019-01-29 传神语联网网络科技股份有限公司 Translate Statement Completion judgment method and system
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347757A (en) * 2020-10-12 2021-02-09 四川语言桥信息技术有限公司 Parallel corpus alignment method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107832229B (en) NLP-based system test case automatic generation method
CN103593352B (en) A kind of mass data cleaning method and device
CN105243055B (en) Based on multilingual segmenting method and device
CN113282955B (en) Method, system, terminal and medium for extracting privacy information in privacy policy
CN107992476B (en) Corpus generation method and system for sentence-level biological relation network extraction
CN109446526B (en) Method and device for constructing implicit chapter relation corpus and storage medium
Evert A Lightweight and Efficient Tool for Cleaning Web Pages.
Vel Pre-processing techniques of text mining using computational linguistics and python libraries
CN112257462A (en) Hypertext markup language translation method based on neural machine translation technology
CN109299470B (en) Method and system for extracting trigger words in text bulletin
CN115794225A (en) Method for processing business flow based on natural language
CN102163189A (en) Method and device for extracting evaluative information from critical texts
CN110321532A (en) Language pre-processes punctuate method, computer equipment and computer readable storage medium
CN109446277A (en) Relational data intelligent search method and system based on Chinese natural language
CN111090755B (en) Text incidence relation judging method and storage medium
CN102629244B (en) Multi-language work card generating system and method
CN109684395B (en) Visual data interface universal analysis method based on natural language processing
CN113052544A (en) Method and device for intelligently adapting workflow according to user behavior and storage medium
CN109558580B (en) Text analysis method and device
CN102982029B (en) A kind of search need recognition methods and device
US11874864B2 (en) Method and system for creating a domain-specific training corpus from generic domain corpora
CN110826330B (en) Name recognition method and device, computer equipment and readable storage medium
CN113128231A (en) Data quality inspection method and device, storage medium and electronic equipment
CN110866394A (en) Company name identification method and device, computer equipment and readable storage medium
CN110610001A (en) Short text integrity identification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191011