CN110321532A - Language pre-processes punctuate method, computer equipment and computer readable storage medium - Google Patents
Language pre-processes punctuate method, computer equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110321532A CN110321532A CN201910493707.3A CN201910493707A CN110321532A CN 110321532 A CN110321532 A CN 110321532A CN 201910493707 A CN201910493707 A CN 201910493707A CN 110321532 A CN110321532 A CN 110321532A
- Authority
- CN
- China
- Prior art keywords
- punctuate
- language
- symbol
- processes
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000008569 process Effects 0.000 title claims abstract description 21
- 238000004590 computer program Methods 0.000 claims description 15
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000001788 irregular Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
A kind of language pretreatment punctuate method computer equipment and computer readable storage medium, the language pretreatment punctuate method includes: parsing document to be processed, obtains at least one subdocument;The subdocument is combined, obtains punctuate collection, it includes at least one punctuate symbol in the punctuate that it includes at least one punctuate that the punctuate, which is concentrated,;The punctuate symbol for including in the punctuate is obtained, and is arranged according to predetermined order;The punctuate that the punctuate is concentrated is recycled according to the predetermined order, obtains the position where current punctuate symbol;According to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;The processing collection is recycled, result set is obtained.Language provided herein pre-processes punctuate method, is optimized to punctuate, realizes most efficient punctuate method, clear thinking.Meanwhile guaranteeing that the complete sentence for reading and translating, quality with higher is finally presented in the more different original text category of language of the sentence after parsing.
Description
Technical field
This application involves intelligent language process fields, and in particular to a kind of language pretreatment punctuate method, computer equipment
And computer readable storage medium.
Background technique
The problem of WebCAT (computer-aided translation of webpage version) is many kinds of on present society, and CAT is converted.It is wherein main
Just comprising parsing, punctuate, the part of duplicate removal.Parsing has third party's business men to provide frame support (such as itext, poi etc.).Punctuate and
Duplicate removal but without standard, causes Language Processing inaccurate.
Summary of the invention
The main purpose of the application is to provide a kind of language pretreatment punctuate method, not marked with solving punctuate and duplicate removal
Standard leads to the problem of Language Processing inaccuracy.
To achieve the goals above, according to the one aspect of the application, a kind of language pretreatment punctuate method, packet are provided
It includes: parsing document to be processed, obtain at least one subdocument;The subdocument is combined, punctuate collection is obtained, the punctuate concentrates packet
It include at least one punctuate symbol in the punctuate containing at least one punctuate;The punctuate symbol for including in the punctuate is obtained, and
It is arranged according to predetermined order;The punctuate that the punctuate is concentrated is recycled according to the predetermined order, obtains current punctuate symbol
The position at place;According to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;Circulation institute
Processing collection is stated, result set is obtained.
Optionally, according to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing Ji Bao
It includes: when current character is if it is abbreviation, then not making pauses in reading unpunctuated ancient writings;When being English in original text, character is not if being that space increasing is write after judgement
When female, then do not make pauses in reading unpunctuated ancient writings;When current character is number, punctuate symbol is point number, when rear character is not number, is then made pauses in reading unpunctuated ancient writings;Work as punctuate
Symbol end followed by "or" when, then do not make pauses in reading unpunctuated ancient writings, punctuate symbol position elapses one backward.
Optionally, the language pre-processes punctuate method further include: when it includes Global Information that the punctuate, which is concentrated, by institute
It states Global Information and is added directly to result set.
Optionally, the Global Information are as follows: hyperlink, number, time, period or mail address.
Optionally, language pre-processes punctuate method further include: judges the language of the punctuate.
Optionally, the language of the punctuate is met collection with preset punctuate to match.
Optionally, the processing collection is recycled, obtaining result set includes: by the sentence of the ending of punctuate symbol or symbol of having made pauses in reading unpunctuated ancient writings
Result set is added in the sentence of number followed by "or" ending, by remaining sentence after this sentence punctuate be added next sentence into
Row circulation.
To achieve the goals above, according to the one aspect of the application, a kind of computer equipment, including storage are additionally provided
Device, processor and storage in the memory and the computer program that can be run by the processor, the processor execution
Method described in any of the above embodiments is realized when the computer program.
To achieve the goals above, according to the one aspect of the application, a kind of computer readable storage medium is additionally provided,
Non-volatile readable storage medium, is stored with computer program, and the computer program is realized when executed by the processor
Method described in any of the above embodiments.
To achieve the goals above, according to the one aspect of the application, a kind of computer program product is additionally provided, including
Computer-readable code causes the computer equipment to execute when the computer-readable code is executed by computer equipment
Method described in any of the above embodiments.
Language provided herein pre-processes punctuate method, is optimized to punctuate, realizes most efficient punctuate
Method, clear thinking.Meanwhile it is complete for reading to guarantee that the more different original text category of language of the sentence after parsing is finally presented
With the sentence of translation, quality with higher.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other
Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not
Constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the overall flow figure that punctuate method is pre-processed according to the language of one embodiment of the application;
Fig. 2 is the flow chart that punctuate method is pre-processed according to the language of one embodiment of the application;
Fig. 3 is the system schematic according to the computer equipment of the application one embodiment;And
Fig. 4 is the schematic diagram according to the computer readable storage medium of the application one embodiment.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
In addition, term " installation ", " setting ", " being equipped with ", " connection ", " connected ", " socket " shall be understood in a broad sense.For example,
It may be a fixed connection, be detachably connected or monolithic construction;It can be mechanical connection, or electrical connection;It can be direct phase
It even, or indirectly connected through an intermediary, or is two connections internal between device, element or component.
For those of ordinary skills, the concrete meaning of above-mentioned term in this application can be understood as the case may be.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As Figure 1-Figure 2, this application involves a kind of language to pre-process punctuate method, comprising:
S102: parsing document to be processed, obtains at least one subdocument;
S104: combining the subdocument, obtains punctuate collection, and the punctuate is concentrated comprising at least one punctuate, the punctuate
In include at least one punctuate symbol;
S106: the punctuate symbol for including in the punctuate is obtained, and is arranged according to predetermined order;
S108: recycling the punctuate that the punctuate is concentrated according to the predetermined order, obtains current punctuate symbol place
Position;
S110: according to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;
S112: the circulation processing collection obtains result set.
It is parsed when in use, document (such as Word, PPT etc.) is first uploaded to editing machine, editing machine is corresponding document
It is parsed into irregular not carefully and neatly done a words or sentence.It such as should be that in short may resolve to multiple words, or one
A English word resolves to several letters.Such as: one is that " I has a book in original text word.You also have.He is also.", but
It is may to be parsed by third square bearer as " I ", " having one ", " book.You ", " also have.He is also." four words.At this moment
Him will be handled by waiting.
Then the irregular sentence that combined analysis comes out, each a minimum of punctuate of sentence accords with after guaranteeing combination when combination
Number, this is write sentence and is stored in for making pauses in reading unpunctuated ancient writings in a list, this list is known as collection of making pauses in reading unpunctuated ancient writings by we.Such as: parsing
Document carry out arrangement merging, as soon as guarantee a minimum of punctuate symbol of each sentence, then arrange after as follows " I has one
Book.You ", " also have.He is also." two words, purpose the step of being simplified subsequent punctuate of this operation.
The punctuate symbol for including in this then sequencing sequence is obtained, then by punctuate symbol sequence circulation the words
Make pauses in reading unpunctuated ancient writings, obtains the position where current punctuate symbol.It in the present embodiment, further include judgement between punctuate symbol obtaining
Original text language obtains the punctuate symbol for including in this then sequencing sequence further according to original text language, then by punctuate symbol
Number sequence circulation the words make pauses in reading unpunctuated ancient writings, obtain the position where current punctuate symbol.Such as: in the original text of English be: " I!
It is king.You ".This punctuate will make pauses in reading unpunctuated ancient writings, and first be ranked up to punctuate symbol, and the punctuate symbol the inside as in English has
{"!","!", "? ", "? ", " ... ", " ", ".","···"};, punctuate symbol is after handling at this time
{"!", "."}.
In one embodiment of the application, according to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and
When to obtain processing collection include: current character if it is abbreviation, then do not make pauses in reading unpunctuated ancient writings;When being English in original text, character is not if being after judgement
When space adds capitalization, then do not make pauses in reading unpunctuated ancient writings;Current character is number, and punctuate symbol is point number, when rear character is not number, is then broken
Sentence;When punctuate symbol end followed by "or", then do not make pauses in reading unpunctuated ancient writings, punctuate symbol position elapses one backward.It opens at this time
Begin formal punctuate, and the punctuate assemble of symbol and original text provided according to previous step is made pauses in reading unpunctuated ancient writings: as original text is: " I!It is king.
You ", punctuate symbol be "!", "."}.Become " I after this step!", it " is king." two words, a remaining word will addition
Lower a word is made pauses in reading unpunctuated ancient writings.
Circular treatment collection, punctuate symbol has been followed by the sentence of the ending of punctuate symbol or " or)] sentence of ending
Result set is added to recycle sentence addition next sentence remaining after this sentence punctuate.
This step is exactly to be in order to which special scenes optimize the form made pauses in reading unpunctuated ancient writings, as " (hello!) ", such case!With) quality inspection is not
It will disconnect, in addition dismantle last several words and can be added down and in short make pauses in reading unpunctuated ancient writings.
In one embodiment of the application, the language pre-processes punctuate method further include: it includes whole for concentrating when the punctuate
When body information, the Global Information is added directly to result set.In the present embodiment, Global Information are as follows: hyperlink, number,
Time, period or mail address, but be not limited thereto.Judge whether this sentence is hyperlink, number, period, the time, mail
Deng if this sentence without subsequent processing, is directly appended to result set.Such as: the sentence parsed is 2019 or surpasses
Link etc., would not carry out making pauses in reading unpunctuated ancient writings, these not will do it translation, original text can be copied directly at translation.
In one embodiment of the application, language pre-processes punctuate method further include: judges the language of the punctuate.
In one embodiment of the application, the language of the punctuate is met into collection with preset punctuate and is matched.
In one embodiment of the application, the processing collection is recycled, obtaining result set includes: the sentence that punctuate symbol ends up
Result set is added in the sentence of son or punctuate symbol followed by "or" ending, and remaining sentence after this sentence punctuate is added
Next sentence is recycled.
As shown in figure 3, the application also provides a kind of computer equipment, including memory, processor and it is stored in described deposit
In reservoir and the computer program that can be run by the processor, when processor execution computer program, is realized above-mentioned
Described in any item methods.
As shown in figure 4, the application also provides a kind of computer readable storage medium, non-volatile readable storage medium,
It is inside stored with computer program, the computer program realizes method described in any of the above embodiments when executed by the processor.
The application also provides a kind of computer program product, including computer-readable code, when the computer-readable generation
When code is executed by computer equipment, the computer equipment is caused to execute method described in any of the above embodiments.
Language provided herein pre-processes punctuate method adapted relationships type database, is suitable for existing all programming languages
Speech realizes the punctuate step in pretreatment, the text-processing after parsing at can normally read high quality cypher text.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field
For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair
Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.
Claims (10)
1. a kind of language pre-processes punctuate method characterized by comprising
Document to be processed is parsed, at least one subdocument is obtained;
The subdocument is combined, obtains collection of making pauses in reading unpunctuated ancient writings, it includes at least in the punctuate that it includes at least one punctuate that the punctuate, which is concentrated,
One punctuate symbol;
The punctuate symbol for including in the punctuate is obtained, and is arranged according to predetermined order;
The punctuate that the punctuate is concentrated is recycled according to the predetermined order, obtains the position where current punctuate symbol;
According to before and after punctuate character position character and preset rules make pauses in reading unpunctuated ancient writings again, and obtain processing collection;
The processing collection is recycled, result set is obtained.
2. language according to claim 1 pre-processes punctuate method, which is characterized in that before and after punctuate character position
Character and preset rules are made pauses in reading unpunctuated ancient writings again, and are obtained processing collection and included:
When current character is if it is abbreviation, then do not make pauses in reading unpunctuated ancient writings;
When being English in original text, when character is not if being that space adds capitalization after judgement, then do not make pauses in reading unpunctuated ancient writings;
When current character is number, punctuate symbol is point number, when rear character is not number, is then made pauses in reading unpunctuated ancient writings;
When punctuate symbol end followed by "or", then do not make pauses in reading unpunctuated ancient writings, punctuate symbol position elapses one backward.
3. language according to claim 1 pre-processes punctuate method, which is characterized in that the language pre-processes punctuate method
Further include: when it includes Global Information that the punctuate, which is concentrated, the Global Information is added directly to result set.
4. language according to claim 3 pre-processes punctuate method, which is characterized in that the Global Information are as follows: hyperlink,
Number, time, period or mail address.
5. language according to claim 1 pre-processes punctuate method, which is characterized in that language pretreatment punctuate method is also wrapped
It includes:
Judge the language of the punctuate.
6. language according to claim 5 pre-processes punctuate method, which is characterized in that by the language of the punctuate and preset
Punctuate meet collection matching.
7. language according to claim 1 pre-processes punctuate method, which is characterized in that the circulation processing collection is tied
Fruit collection includes: that result set is added in the sentence of the sentence of the ending of punctuate symbol or punctuate symbol followed by "or" ending,
Next sentence is added in remaining sentence after this sentence punctuate to recycle.
8. a kind of computer equipment, including memory, processor and storage can be transported in the memory and by the processor
Capable computer program, which is characterized in that the processor is realized when executing the computer program as appointed in claim 1-7
Method described in one.
9. a kind of computer readable storage medium, non-volatile readable storage medium are stored with computer program, feature
It is, the computer program realizes such as method of any of claims 1-7 when executed by the processor.
10. a kind of computer program product, including computer-readable code, which is characterized in that when the computer-readable code
When being executed by computer equipment, the computer equipment perform claim is caused to require method described in any one of 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910493707.3A CN110321532A (en) | 2019-06-06 | 2019-06-06 | Language pre-processes punctuate method, computer equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910493707.3A CN110321532A (en) | 2019-06-06 | 2019-06-06 | Language pre-processes punctuate method, computer equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110321532A true CN110321532A (en) | 2019-10-11 |
Family
ID=68120864
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910493707.3A Pending CN110321532A (en) | 2019-06-06 | 2019-06-06 | Language pre-processes punctuate method, computer equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321532A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347757A (en) * | 2020-10-12 | 2021-02-09 | 四川语言桥信息技术有限公司 | Parallel corpus alignment method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107564526A (en) * | 2017-07-28 | 2018-01-09 | 北京搜狗科技发展有限公司 | Processing method, device and machine readable media |
CN107832308A (en) * | 2017-12-11 | 2018-03-23 | 中译语通科技股份有限公司 | A kind of punctuate method and system of machine translation, computer program, computer |
CN108073572A (en) * | 2016-11-16 | 2018-05-25 | 北京搜狗科技发展有限公司 | Information processing method and its device, simultaneous interpretation system |
CN108628819A (en) * | 2017-03-16 | 2018-10-09 | 北京搜狗科技发展有限公司 | Treating method and apparatus, the device for processing |
CN109284503A (en) * | 2018-10-22 | 2019-01-29 | 传神语联网网络科技股份有限公司 | Translate Statement Completion judgment method and system |
CN109408833A (en) * | 2018-10-30 | 2019-03-01 | 科大讯飞股份有限公司 | A kind of interpretation method, device, equipment and readable storage medium storing program for executing |
-
2019
- 2019-06-06 CN CN201910493707.3A patent/CN110321532A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108073572A (en) * | 2016-11-16 | 2018-05-25 | 北京搜狗科技发展有限公司 | Information processing method and its device, simultaneous interpretation system |
CN108628819A (en) * | 2017-03-16 | 2018-10-09 | 北京搜狗科技发展有限公司 | Treating method and apparatus, the device for processing |
CN107564526A (en) * | 2017-07-28 | 2018-01-09 | 北京搜狗科技发展有限公司 | Processing method, device and machine readable media |
CN107832308A (en) * | 2017-12-11 | 2018-03-23 | 中译语通科技股份有限公司 | A kind of punctuate method and system of machine translation, computer program, computer |
CN109284503A (en) * | 2018-10-22 | 2019-01-29 | 传神语联网网络科技股份有限公司 | Translate Statement Completion judgment method and system |
CN109408833A (en) * | 2018-10-30 | 2019-03-01 | 科大讯飞股份有限公司 | A kind of interpretation method, device, equipment and readable storage medium storing program for executing |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347757A (en) * | 2020-10-12 | 2021-02-09 | 四川语言桥信息技术有限公司 | Parallel corpus alignment method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107832229B (en) | NLP-based system test case automatic generation method | |
CN103593352B (en) | A kind of mass data cleaning method and device | |
CN105243055B (en) | Based on multilingual segmenting method and device | |
CN113282955B (en) | Method, system, terminal and medium for extracting privacy information in privacy policy | |
CN107992476B (en) | Corpus generation method and system for sentence-level biological relation network extraction | |
CN109446526B (en) | Method and device for constructing implicit chapter relation corpus and storage medium | |
Evert | A Lightweight and Efficient Tool for Cleaning Web Pages. | |
Vel | Pre-processing techniques of text mining using computational linguistics and python libraries | |
CN112257462A (en) | Hypertext markup language translation method based on neural machine translation technology | |
CN109299470B (en) | Method and system for extracting trigger words in text bulletin | |
CN115794225A (en) | Method for processing business flow based on natural language | |
CN102163189A (en) | Method and device for extracting evaluative information from critical texts | |
CN110321532A (en) | Language pre-processes punctuate method, computer equipment and computer readable storage medium | |
CN109446277A (en) | Relational data intelligent search method and system based on Chinese natural language | |
CN111090755B (en) | Text incidence relation judging method and storage medium | |
CN102629244B (en) | Multi-language work card generating system and method | |
CN109684395B (en) | Visual data interface universal analysis method based on natural language processing | |
CN113052544A (en) | Method and device for intelligently adapting workflow according to user behavior and storage medium | |
CN109558580B (en) | Text analysis method and device | |
CN102982029B (en) | A kind of search need recognition methods and device | |
US11874864B2 (en) | Method and system for creating a domain-specific training corpus from generic domain corpora | |
CN110826330B (en) | Name recognition method and device, computer equipment and readable storage medium | |
CN113128231A (en) | Data quality inspection method and device, storage medium and electronic equipment | |
CN110866394A (en) | Company name identification method and device, computer equipment and readable storage medium | |
CN110610001A (en) | Short text integrity identification method and device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191011 |