CN108628819B - Processing method and device for processing - Google Patents

Processing method and device for processing Download PDF

Info

Publication number
CN108628819B
CN108628819B CN201710157267.5A CN201710157267A CN108628819B CN 108628819 B CN108628819 B CN 108628819B CN 201710157267 A CN201710157267 A CN 201710157267A CN 108628819 B CN108628819 B CN 108628819B
Authority
CN
China
Prior art keywords
sentence
optimal
text
processed
breaking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710157267.5A
Other languages
Chinese (zh)
Other versions
CN108628819A (en
Inventor
姜里羊
王宇光
陈伟
程善伯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201710157267.5A priority Critical patent/CN108628819B/en
Publication of CN108628819A publication Critical patent/CN108628819A/en
Application granted granted Critical
Publication of CN108628819B publication Critical patent/CN108628819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a processing method and device and a device for processing, wherein the method specifically comprises the following steps: acquiring a text to be processed; acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result; and outputting the optimal sentence-breaking result corresponding to the processed text. The embodiment of the invention can improve the translation quality of the sentence-break result corresponding to the text to be processed.

Description

Processing method and device for processing
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a processing method and apparatus, and an apparatus for processing.
Background
Sentence-breaking technology is an important basic technology in the field of natural language processing. The sentence break is to cut the text into sentences with complete semantics. Since segmenting a text into sentences with complete semantics is the first step in implementing machine recognition of human languages, sentence-breaking techniques are widely applied in application branches of natural language processing such as machine translation, speech recognition, information services, and the like.
Machine translation technology refers to the process of converting one natural language (source language) to another natural language (target language) using a computer. Before machine translation, the traditional machine translation technology usually carries out sentence-breaking processing on a source text input by a user or a source text obtained by voice recognition, and then carries out machine translation according to a sentence-breaking processing result; therefore, the accuracy of the sentence-breaking processing result has a crucial influence on the machine translation quality, and the accuracy of the sentence-breaking processing result directly influences the machine translation quality.
The existing scheme usually adopts a mode of setting a threshold value to perform text sentence break. For example, if the number of commas included in a text exceeds a first threshold value or if the number of words included in the text exceeds a second threshold value, the text is punctuated.
However, sentences with incomplete semantics are easily generated in the sentence break processing result obtained by the existing scheme, and the translation quality of machine translation is affected by the sentences with incomplete semantics, so that the sentence break processing result of the existing scheme causes lower translation quality of machine translation.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a processing method, a processing apparatus, and an apparatus for processing, which overcome or at least partially solve the above problems, and can improve the translation quality of a sentence break result corresponding to a text to be processed.
In order to solve the above problem, the present invention discloses a processing method, comprising:
acquiring a text to be processed;
acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises the following steps: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result;
and outputting the optimal sentence-breaking result corresponding to the processed text.
Optionally, the obtaining an optimal sentence break result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed includes:
and acquiring an optimal sentence-breaking result corresponding to the text to be processed by utilizing a dynamic programming algorithm according to segmentation points obtained based on preset punctuation marks contained in the text to be processed.
Optionally, the obtaining, by using a dynamic programming algorithm, an optimal sentence break result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed includes:
determining a clause sequence set corresponding to the text to be processed according to preset punctuation marks contained in the text to be processed;
determining backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset in a recursion mode according to the sequence of the subsets of the sentence-breaking sequence set from small to large; the comprehensive translation quality corresponding to the optimal subset sentence-breaking result is optimal;
and obtaining the optimal sentence-breaking result corresponding to the text to be processed according to the backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset of the sentence-breaking sequence set.
Optionally, the subset of the set of sentence sequences comprises: the first i clauses of the text to be processed, the optimal subset comprehensive translation quality score corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and then the backtracking segmentation points of the sentence-breaking results of the optimal subsets corresponding to the subsets are determined in a recursion manner according to the sequence from small to large of the subsets of the clause sequence set, and the backtracking segmentation points comprise:
segmenting the first i clauses by using a segmentation point k to obtain optimal subset comprehensive translation quality scores F (k) of the first i clauses and a first semantic unit corresponding to the segmentation point k and translation quality scores of a second semantic unit; wherein the first semantic unit comprises: the first i clauses include clauses before a partition point k, and the second semantic unit includes: k is more than or equal to 0 and less than i in the clauses which are included in the first i clauses and are positioned behind the dividing point k;
synthesizing the translation quality scores of the F (k) and the second semantic unit to obtain a comprehensive translation quality score corresponding to the first i clauses and the segmentation point k;
according to the first i sentences and the comprehensive translation quality scores corresponding to the segmentation points k, acquiring target segmentation points corresponding to the optimal comprehensive translation quality scores from at least one segmentation point k corresponding to the first i sentences;
and taking the target segmentation point as a backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point as the comprehensive translation quality score F (i) of the optimal subset corresponding to the first i clauses.
Optionally, the obtaining the optimal sentence-breaking result corresponding to the text to be processed according to the backtracking segmentation point of the optimal subset sentence-breaking result corresponding to each subset of the sentence-breaking sequence set includes:
backtracking the backtracking segmentation points of the optimal subset sentence-breaking results corresponding to all subsets of the sentence-sequence set to obtain the backtracking segmentation points of the optimal subset sentence-breaking results corresponding to the maximum subset of the sentence-sequence set;
and segmenting the text to be processed according to the backtracking segmentation point of the optimal subset punctuation result corresponding to the maximum subset of the punctuation sequence set so as to obtain the optimal punctuation result corresponding to the text to be processed.
Optionally, the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set includes:
acquiring first backtracking segmentation points P1 corresponding to the first i clauses;
and acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
Optionally, the obtaining an optimal sentence-breaking result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed includes:
carrying out sentence-breaking processing on the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed so as to obtain a plurality of sentence-breaking results corresponding to the text to be processed;
determining the comprehensive translation quality corresponding to the sentence-breaking result;
and selecting a sentence break result with optimal comprehensive translation quality from the multiple sentence break results corresponding to the text to be processed as the optimal sentence break result corresponding to the text to be processed.
Optionally, the preset punctuation mark comprises: commas and/or semicolons.
In another aspect, the present invention discloses a processing apparatus comprising:
the text to be processed acquisition module is used for acquiring a text to be processed;
the optimal sentence break result acquisition module is used for acquiring an optimal sentence break result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises the following steps: the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result; and
and the optimal sentence-break result output module is used for outputting the optimal sentence-break result corresponding to the processed text.
Optionally, the optimal sentence-break result obtaining module includes:
and the dynamic programming obtaining submodule is used for obtaining an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed by using a dynamic programming algorithm.
Optionally, the dynamic programming acquisition sub-module includes:
the sentence sequence set determining unit is used for determining a sentence sequence set corresponding to the text to be processed according to preset punctuations contained in the text to be processed;
a recursion unit, configured to determine, in a recursion manner, a backtracking segmentation point of each subset corresponding to an optimal subset sentence-breaking result according to a sequence from small to large of the subsets of the clause sequence set; and
and the optimal sentence break result acquisition unit is used for acquiring an optimal sentence break result corresponding to the text to be processed according to the backtracking segmentation points of the optimal subset sentence break result corresponding to each subset of the sentence sequence set.
Optionally, the subset of the set of sentence sequences comprises: the first i clauses of the text to be processed, the optimal subset comprehensive translation quality score corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and the recursion unit includes:
the subset sentence-breaking unit is used for breaking the first i sentences by using a segmentation point k so as to obtain optimal subset comprehensive translation quality scores F (k) of the first semantic units corresponding to the first i sentences and the segmentation point k and translation quality scores of second semantic units; wherein the first semantic unit comprises: the first i clauses include clauses before a partition point k, and the second semantic unit includes: k is more than or equal to 0 and less than i in the clauses which are included in the first i clauses and are positioned behind the dividing point k;
the quality comprehensive subunit is used for synthesizing the F (k) and the translation quality scores of the second semantic unit to obtain comprehensive translation quality scores corresponding to the first i clauses and the segmentation points k;
a target segmentation point obtaining subunit, configured to obtain, according to the i preceding clauses and the comprehensive translation quality score corresponding to the segmentation point k, a target segmentation point corresponding to an optimal comprehensive translation quality score from at least one segmentation point k corresponding to the i preceding clauses;
and the backtracking segmentation point acquisition subunit is used for taking the target segmentation point as a backtracking segmentation point of the optimal subset sentence-breaking result corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point as an optimal subset comprehensive translation quality score F (i) corresponding to the first i clauses.
Optionally, the optimal sentence-break result obtaining unit includes:
a backtracking subunit, configured to backtrack the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set, so as to obtain the backtracking segmentation points of the optimal subset sentence-break result corresponding to the maximum subset of the sentence-sequence set;
and the sentence backtracking and breaking unit is used for carrying out sentence breaking on the text to be processed according to the backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the maximum subset of the sentence-splitting sequence set so as to obtain the optimal sentence-breaking result corresponding to the text to be processed.
Optionally, the backtracking subunit includes:
the first backtracking unit is used for acquiring first backtracking segmentation points P1 corresponding to the previous i clauses;
and the second backtracking unit is used for acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
Optionally, the optimal sentence-break result obtaining module includes:
the exhaustion submodule is used for carrying out sentence breaking processing on the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed so as to obtain a plurality of sentence breaking results corresponding to the text to be processed;
the comprehensive quality determining submodule is used for determining the comprehensive translation quality corresponding to the sentence-breaking result;
and the result selection submodule is used for selecting the sentence breaking result with the optimal comprehensive translation quality from the multiple sentence breaking results corresponding to the text to be processed as the optimal sentence breaking result corresponding to the text to be processed.
Optionally, the preset punctuation mark comprises: commas and/or semicolons.
In yet another aspect, an apparatus for processing is disclosed that includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
acquiring a text to be processed;
acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result;
and outputting the optimal sentence-breaking result corresponding to the processed text.
The embodiment of the invention has the following advantages:
according to segmentation points obtained based on preset punctuations included in a text to be processed, the embodiment of the invention obtains an optimal sentence-breaking result corresponding to the text to be processed; since the comprehensive translation quality of the optimal sentence-punctuating result of the embodiment of the present invention is optimal, the optimal sentence-punctuating result may include: the comprehensive translation quality can be the comprehensive translation quality corresponding to all sentences contained in a sentence break result; therefore, the optimal sentence-break result of the embodiment of the invention can realize the global optimization of the comprehensive translation quality, and the optimal sentence-break result of the embodiment of the invention can improve the translation quality of the sentence-break result corresponding to the text to be processed.
Drawings
FIG. 1 is a schematic diagram of an exemplary configuration of a processing system in accordance with an embodiment of the present invention;
FIG. 2 is a flow chart of one embodiment of a processing method of the present invention;
FIG. 3 is a diagram illustrating a path planning for a pending text according to an embodiment of the present invention;
FIG. 4 is a block diagram of a processing device according to an embodiment of the present invention;
FIG. 5 is a block diagram illustrating an apparatus for processing as a terminal in accordance with an example embodiment; and
FIG. 6 is a block diagram illustrating an apparatus for processing as a server in accordance with an example embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The embodiment of the invention provides a processing scheme, wherein the information scheme can obtain an optimal sentence-breaking result corresponding to a text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; since the comprehensive translation quality of the optimal sentence-punctuating result of the embodiment of the present invention is optimal, the optimal sentence-punctuating result may include: the comprehensive translation quality can be the comprehensive translation quality corresponding to all sentences contained in a sentence breaking result; therefore, the optimal sentence-break result of the embodiment of the invention can realize the global optimization of the comprehensive translation quality, and the global situation can be used for expressing the whole corresponding to the optimal sentence-break result corresponding to the text to be processed, so that the optimal sentence-break result of the embodiment of the invention can improve the translation quality of the sentence-break result corresponding to the text to be processed.
The embodiment of the invention can be applied to any scenes such as machine translation, voice recognition, information service and the like which need sentence break and machine translation, and can understand that the embodiment of the invention does not limit the specific application scenes.
For example, referring to fig. 1, an exemplary structural diagram of a processing system according to an embodiment of the present invention is shown, which may specifically include: processing means 101, machine translation means 102 and translation result output means 103. The processing device 101, the machine translation device 102, and the translation result output device 103 may be independent servers, or may be disposed in the same server together, that is, the specific positions of the processing device 101, the machine translation device 102, and the translation result output device 103 are not limited in the embodiment of the present invention.
The processing device 101 may obtain a text to be processed; carrying out sentence-breaking processing on the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed so as to obtain an optimal sentence-breaking result corresponding to the text to be processed; and outputs the optimal sentence-breaking result corresponding to the processed text to the machine translation device 102.
Alternatively, the processing device 101 may obtain the text to be processed according to the voice signal of the speaking user. In this case, the processing device 101 may convert the voice signal of the speaking user into text information and acquire the text to be processed from the text information. In practical applications, the speaking user may include: a user who speaks and sends a voice signal in the simultaneous interpretation scene, and/or a user who generates a voice signal through a terminal, etc. can receive the voice signal of the speaking user through a microphone or other voice acquisition devices.
Alternatively, the processing device 101 may employ speech recognition techniques to convert speech signals of a speaking user into textual information. If the user is speakingThe speech signal is denoted as S, and after a series of processing is performed on S, a speech feature sequence O corresponding to S is obtained, denoted as O ═ O 1 ,O 2 ,…,O i ,…,O T In which O is i Is the ith speech feature, and T is the total number of speech features. A sentence corresponding to a speech signal S can be regarded as a word string composed of many words, and is denoted by W ═ W 1 ,w 2 ,…,w n }. The process of speech recognition is to find the most likely word string W based on the known speech feature sequence O.
Specifically, the speech recognition is a model matching process, in which a speech model is first established according to the speech characteristics of a person, and a template required for the speech recognition is established by extracting required features through analysis of an input speech signal; the process of recognizing the voice input by the user is a process of comparing the characteristics of the voice input by the user with the template, and finally determining the best template matched with the voice input by the user so as to obtain a voice recognition result. The specific speech recognition algorithm may adopt a training and recognition algorithm based on a statistical hidden markov model, or may adopt other algorithms such as a training and recognition algorithm based on a neural network, a recognition algorithm based on dynamic time warping matching, and the like.
Or, alternatively, the processing device 101 may obtain the text to be processed according to the text input by the user. For example, text input by a user in a scenario of instant messaging, office documents, and the like can be used as a source of the text to be processed.
In practical applications, the processing device 101 may obtain the text to be processed from the text corresponding to the voice signal or the text input by the user according to practical application requirements. Optionally, the text to be processed may be obtained from the text corresponding to the voice signal S according to the interval time of the voice signal S; for example, when the interval time of the voice signal S is greater than the time threshold, a corresponding first demarcation point may be determined according to the time point, a text corresponding to the voice signal S before the first demarcation point is used as a text to be processed, and a text corresponding to the voice signal S after the first demarcation point is processed to continue to obtain the text to be processed therefrom. Optionally, the text to be processed may be obtained from the text corresponding to the voice signal or the text input by the user according to the number of words included in the text corresponding to the voice signal or the text input by the user; for example, when the text corresponding to the voice signal or the text input by the user includes a number of words greater than a word number threshold, the corresponding second demarcation point may be determined according to the word number threshold, the text corresponding to the voice signal S before the second demarcation point may be used as the text to be processed, and the text corresponding to the voice signal S after the second demarcation point may be processed to continue to obtain the text to be processed therefrom.
In the embodiment of the invention, the sentence is a grammatical unit which is formed by words or phrases according to a certain grammatical rule, expresses relatively complete meaning and has obvious tone and intonation. Alternatively, the sentence may include: single sentences and/or compound sentences. The single sentence is a sentence formed by phrases or single words, which independently expresses a relatively complete meaning and has a certain tone of voice, such as "the students have returned to school", "he is very healthy", and the like. The relatively independent single sentence form in the compound sentence is called as a clause, pauses are generally arranged between the clauses, and commas or semicolons are used for representation in writing; the clauses and the clauses have certain relation in meaning, and are often connected by some related words (conjunctions, adverbs or phrases with related functions), such as 'the wish of billions of Chinese people' and the like.
Alternatively, the processing device 101 may insert a corresponding preset punctuation mark into the text information corresponding to the speech signal of the speaking user according to the interval time of the speech signal S and the language model thereof. Optionally, inserted preset punctuation marks may be used to identify pauses between clauses within a sentence, which may include, but are not limited to: comma, pause, semicolon, and the like.
The processing device 101 obtains an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; specifically, in the embodiment of the present invention, all the preset punctuations included in the text to be processed may be used as segmentation points for sentence break processing or not, that is, sentence break processing may be performed on the text to be processed according to a situation that the preset punctuations included in the text to be processed are used as segmentation points for sentence break processing or not, so that one text to be processed corresponds to multiple sentence break schemes and sentence break results corresponding to the multiple sentence break schemes.
In an application example of the present invention, it is assumed that 2 comma punctuations contained in the text [ a, B, C ] to be processed are all possible or impossible to be segmentation points for sentence-breaking processing, and it is assumed that the corresponding sentence-breaking result may include: { (A, B, C) }, { (A), (B), (C) } and { (A, B), (C) } etc., then the embodiment of the present invention can obtain the sentence-breaking result with the optimal comprehensive translation quality; wherein [ ] represents the text to be processed, () represents the sentence obtained by sentence break, and { } represents the sentence break result.
The machine translation device 102 may receive the optimal sentence-breaking result corresponding to the processed text from the processing device 101, and translate the optimal sentence-breaking result corresponding to the processed text into characters in a target language, where the machine translation device 102 may use a machine translation technology to translate the optimal sentence-breaking result, and the machine translation technology may use a computer to convert a target sentence in one natural language (source language) into characters in another natural language (target language), for example, the source language and the target language may be chinese and english, or the source language and the target language may be english and chinese, respectively. Optionally, the types of the machine translation device 102 may include: statistical types and/or neural network types, etc., it will be understood that embodiments of the present invention are not limited to a particular type of machine translation device 102.
The translation result output device 103 may receive the characters in the target language from the machine translation device 102 and output the characters in the target language, and the corresponding output method may include: voice mode and/or interface mode, etc. For example, in the context of simultaneous interpretation, the text in the target language may be converted into speech in the target language and output. Alternatively, the text in the target language may be converted into the speech in the target language by using a text-to-speech conversion technology (e.g., a speech synthesis technology), and the speech in the target language may be output through a speech playing device such as an earphone or a speaker. It is understood that the embodiment of the present invention does not limit the specific process of converting the text in the target language into the speech in the target language and outputting the converted text. As another example, in the context of an information service (e.g., translating a website or translating an APP), the text in the target language obtained by the machine translation apparatus 102 may be directly output, for example, the text in the target language is displayed on a display apparatus such as a screen for a user to view.
It is understood that the processing system shown in fig. 1 is only an example, and in fact, the processing device 101 may output the optimal sentence-breaking result corresponding to the processed text to other devices besides the machine translation device 102, and the embodiment of the present invention is not limited to a specific processing system.
Method embodiment
Referring to fig. 2, a flowchart of an embodiment of a processing method according to the present invention is shown, which may specifically include the following steps:
step 201, acquiring a text to be processed;
step 202, obtaining an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result may include: at least one sentence, wherein the comprehensive translation quality can be the comprehensive translation quality corresponding to all sentences contained in the sentence-breaking result;
and 203, outputting the optimal sentence-breaking result corresponding to the processed text.
The processing method provided by the embodiment of the invention can be applied to the application environment of computing equipment such as a terminal or a server. Optionally, the terminal may include, but is not limited to: smart phones, tablets, laptop portable computers, in-vehicle computers, desktop computers, smart televisions, wearable devices, and the like. The server can be a cloud server or a common server and is used for providing a processing service of the text to be processed for the client.
The processing method provided by the embodiment of the invention can be suitable for processing Chinese, Japanese, Korean and other languages, and is used for improving the translation quality of the sentence-breaking result corresponding to the text to be processed. It will be appreciated that any language requiring sentence breaks is within the scope of applicability of the processing method of embodiments of the present invention.
In the embodiment of the present invention, the text to be processed may be used to represent text that needs to be processed, and the text to be processed may be derived from text or voice input by a user through a computing device, or may be derived from other computing devices. It should be noted that, the text to be processed may include: one language or more than one language, for example, the text to be processed may include chinese, or may include a mixture of chinese and other languages such as english, and the embodiment of the present invention does not limit the specific text to be processed.
In practical applications, the computing device according to the embodiment of the present invention may execute the processing method flow according to the embodiment of the present invention through a client APP (Application), and the client APP may run on the computing device, for example, the client APP may be any APP running on a terminal, and the client APP may obtain a text to be processed from other applications of the computing device. Alternatively, the computing device in the embodiment of the present invention may execute the processing method flow in the embodiment of the present invention through a function device of the client application, and then the function device may obtain the text to be processed from another function device. Alternatively, the computing device of the embodiment of the present invention may be used as a server to execute the processing method of the embodiment of the present invention.
In an optional embodiment of the present invention, the method of the embodiment of the present invention may further include: writing the at least one text to be processed acquired in step 201 into a cache region; step 202 may first read a text to be processed from the buffer, and obtain an optimal sentence-breaking result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the read text to be processed. Optionally, a data structure such as a queue, an array, or a linked list may be established in a memory area of the computing device as the cache area, and the specific cache area is not limited in the embodiment of the present invention. The above-mentioned manner of storing the text to be processed by using the cache region can improve the processing efficiency of the text to be processed, and it can be understood that a manner of storing the text to be processed by using a disk is also feasible, and the embodiment of the present invention does not limit the specific storage manner of the text to be processed.
In the embodiment of the invention, the preset punctuations contained in the text to be processed may or may not be used as segmentation points for sentence break processing, that is, sentence break processing can be performed on the text to be processed according to the situation that the preset punctuations contained in the text to be processed are or may not be used as segmentation points for sentence break processing, so that one text to be processed corresponds to a plurality of sentence break schemes and sentence break results corresponding to the sentence break schemes, and the sentence break result with the optimal comprehensive translation quality is finally obtained in the embodiment of the invention.
The embodiment of the invention can provide the following optimal result acquisition scheme for acquiring the optimal sentence-break result corresponding to the text to be processed according to the segmentation points obtained based on the preset punctuation marks contained in the text to be processed:
optimal result acquisition scheme 1,
The optimal result acquisition scheme 1 may include: carrying out sentence breaking processing on the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed so as to obtain a plurality of sentence breaking results corresponding to the text to be processed; determining the comprehensive translation quality corresponding to the sentence-breaking result; and selecting the sentence break result with the optimal comprehensive translation quality from the multiple sentence break results corresponding to the text to be processed as the optimal sentence break result corresponding to the text to be processed.
In practical application, a path planning algorithm can be adopted to perform sentence-breaking processing on the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed, so as to obtain multiple paths corresponding to the text to be processed and sentence-breaking results corresponding to each path. The principle of the path planning method may be that, in an environment with an obstacle, a collision-free path from an initial state to a target state is found according to a certain evaluation criterion, specifically, in the embodiment of the present invention, the obstacle may be used to represent a segmentation point corresponding to a text to be processed, and the initial state and the target state respectively represent a first clause and a last clause of the text to be processed.
Referring to fig. 3, a schematic diagram of a path planning of a to-be-processed text according to an embodiment of the present invention is shown, where the to-be-processed text is [ a, B, C ], and assuming that 2 commas included in the to-be-processed text [ a, B, C ] are all possible or impossible to be segmentation points for sentence breaking processing, in fig. 3, clauses A, B, C are respectively represented by rectangles, the commas are respectively represented by circles, and when the commas are used as the segmentation points, a hexagon is disposed at a periphery of the corresponding circle, then a sentence breaking result of [ a, B, C ] may include: the { (A, B, C) } corresponding to the 0 division point, the { (A), (B, C) } corresponding to the 1 st comma punctuation point as the division point, the { (A), (B), (C) } corresponding to the 1 st comma punctuation point and the 2 nd comma punctuation point as the division point, and { (A, B), (C) } corresponding to the 2 nd comma punctuation point.
It can be understood that the path planning algorithm is only an optional embodiment of the present invention, and actually, a person skilled in the art may obtain the multiple sentence-breaking results corresponding to the text to be processed by using other algorithms according to actual application requirements.
In an optional embodiment of the present invention, the determining the comprehensive translation quality corresponding to the sentence break result may include: determining a corresponding translation quality score aiming at sentences contained in each sentence break result; fusing translation quality scores corresponding to all sentences contained in each sentence break result to obtain corresponding comprehensive translation quality scores; the sentence-break result with the highest comprehensive translation quality score can be obtained from all the sentence-break results and used as the optimal sentence-break result corresponding to the text to be processed.
Optionally, the process of determining a corresponding translation quality score for the sentence included in each sentence break result may include: a machine translation evaluation method may be employed to determine a translation quality score corresponding to a sentence. The machine translation evaluation method may include: an automatic evaluation method and/or a manual evaluation method; the automatic evaluation method can acquire an evaluation set (including a source language input sentence and a reference translation) in advance, and can calculate the translation quality score corresponding to the sentence according to an N-gram (an N-gram, for example, a 'favorite home' is a binary grammar, and a 'favorite apple' is a ternary grammar) in which a machine translation result corresponding to the sentence is overlapped with the reference translation. It can be understood that any machine translation evaluation method is feasible, and the specific process of determining the corresponding translation quality score for the sentence included in each sentence break result is not limited in the embodiment of the present invention.
Optionally, the process of fusing the translation quality scores corresponding to all sentences included in each sentence break result may include: the translation quality scores corresponding to all sentences contained in each sentence break result are summed, or multiplied, or weighted average processed, and the like, and it can be understood that the embodiment of the present invention does not impose any limitation on the specific process of fusing the translation quality scores corresponding to all sentences contained in each sentence break result.
Optimal result acquisition scheme 2,
The optimal result acquisition scheme 2 may include: and acquiring an optimal sentence-breaking result corresponding to the text to be processed by utilizing a dynamic programming algorithm according to segmentation points obtained based on preset punctuations contained in the text to be processed.
The principle of the dynamic programming algorithm may be that by splitting the problem, the problem states and the relationships between the states are defined so that the problem can be solved in a recursive (or divide and conquer) manner. Specifically, in the embodiment of the present invention, the problem may be that the comprehensive translation quality corresponding to the sentence-break result corresponding to the text to be processed is optimal, and the state may be that the comprehensive translation quality corresponding to the sentence-break result corresponding to each subset of the sentence-sequence set corresponding to the text to be processed is optimal. Compared with the optimal result acquisition scheme 1, the method can exhaust a plurality of sentence-breaking results corresponding to the text to be processed and determine the comprehensive translation quality of the plurality of sentence-breaking results, the dynamic programming algorithm adopted by the optimal result acquisition scheme 2 can reduce the operation amount, and the reduction range of the operation amount is increased along with the increase of the number of the preset punctuation marks contained in the text to be processed.
Optionally, the obtaining, by using a dynamic programming algorithm, an optimal sentence break result corresponding to the text to be processed according to the segmentation point obtained based on the preset punctuation mark included in the text to be processed may specifically include: determining a clause sequence set corresponding to the text to be processed according to preset punctuation marks contained in the text to be processed; determining backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset in a recursion mode according to the sequence of the subsets of the sentence-splitting sequence set from small to large; and obtaining the optimal sentence-breaking result corresponding to the text to be processed according to the backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset of the sentence-breaking sequence set.
The sentence sequence set may be used to represent a set of sequences formed by consecutive sentences contained in the text to be processed, and optionally, the sentence sequence included in the sentence sequence set may be formed by the first i consecutive sentences of the target vocabulary, for example, the text to be processed [ C 1 C 2 …C M ]The corresponding set of sentence sequences may include: { C 1 ,C 1 C 2 ,C 1 C 2 C 3 ,…,C 1 C 2 …C M The subsets contained in the sentence-sequence set can be expressed as follows according to the sequence length (namely the number of the sentences contained in the sequence) from small to large: { C 1 }、{ C 1 C 2 }、{ C 1 C 2 C 3 }…{C 1 C 2 …C M The adjacent clauses in the clause sequence corresponding to the subset can be connected through a preset punctuation mark; optionally, a subset of embodiments of the present invention may comprise a sequence of clauses, where C is i The method is used for representing the ith clause contained in the text to be processed, i is a positive integer which is greater than or equal to 0, M represents the clause number of the text to be processed, and M is a positive integer.
For each subset of the sentence segmentation sequence set, the corresponding subset sentence segmentation result also corresponds to the comprehensive translation quality, so that the embodiment of the invention can determine the backtracking segmentation point of the optimal subset sentence segmentation result corresponding to each subset; the backtracking segmentation point of the optimal subset sentence-breaking result can be used for representing which preset punctuation mark is segmented or sentence-breaking when the subset corresponds to the optimal subset sentence-breaking result. Suppose a subset { C 1 C 2 C 3 The sentence-breaking result corresponding to the optimal subset is { (C) 1 ),(C 2 C 3 ) Description of the subset { C } 1 C 2 C 3 Is at "C 1 "where is divided or sentence-broken, the corresponding backtracking division point can be represented as" C 1 "number 1, it can be understood that the specific representation manner of the backtracking segmentation point is not limited in the embodiment of the present invention.
The embodiment of the present invention may determine the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset in a recursive manner according to the sequence of the subsets of the clause sequence set from small to large, and assume that each subset is expressed as: g 1 、G 2 、G 3 …G u Wherein u is a positive integer, G can be obtained sequentially 1 、G 2 、G 3 …G u Backtracking segmentation points corresponding to the optimal subset sentence-breaking result; also, for Go (1 ≦ o ≦ u), a subset before Go (e.g., G) is needed o-1 、G o-2 Etc.) to determine the backtracking segmentation point of Go corresponding to the optimal subset sentence-breaking result.
In an optional embodiment of the invention, the subset of the set of sentence sequences may comprise: the first i clauses of the text to be processed, the optimal subset comprehensive translation quality score corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and then the backtracking segmentation points of the sentence-breaking results of the optimal subsets corresponding to the subsets are determined in a recursion manner according to the sequence from small to large of the subsets of the clause sequence set, which may specifically include:
segmenting the first i clauses by using a segmentation point k to obtain optimal subset comprehensive translation quality scores F (k) of the first i clauses and a first semantic unit corresponding to the segmentation point k and translation quality scores of a second semantic unit; wherein the first semantic unit may include: the first i sentences may include sentences located before the partition point k, and the second semantic unit may include: k is more than or equal to 0 and less than i in the clauses which are included in the first i clauses and are positioned behind the dividing point k;
synthesizing the translation quality scores of the F (k) and the second semantic unit to obtain a comprehensive translation quality score corresponding to the first i sentences and the segmentation points k;
according to the first i sentences and the comprehensive translation quality scores corresponding to the segmentation points k, obtaining target segmentation points k' corresponding to the optimal comprehensive translation quality scores from at least one segmentation point k corresponding to the first i sentences; in practical applications, the number of the segmentation points k may be one or more, and the number of the target segmentation points k 'may be one or more, but the set corresponding to the target segmentation points k' may be less than or equal to the set corresponding to the segmentation points k. Assuming that the set corresponding to the splitting point k is {0,1,2,3 … k }, the set corresponding to the target splitting point k 'may be a subset of {0,1,2,3 … k }, for example, the set corresponding to the target splitting point k' may be {0,1 }, etc.
And taking the target segmentation point k 'as a backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point k' as the comprehensive translation quality score F (i) of the optimal subset corresponding to the first i clauses.
In the embodiment of the invention, the semantic units can be used for expressing a unit expressing one meaning, and the first semantic unit and the second semantic unit can express two semantic units obtained by segmenting the first i clauses by using the segmentation point k. In practical application, the first i sentences are segmented by using the segmentation points k, and a first semantic unit which is included in the first i sentences and is positioned in front of the segmentation points k and a second semantic unit which is included in the first i sentences and is positioned behind the segmentation points k are obtained. It can be understood that, in the embodiment of the present invention, the number of clauses included in the first semantic unit and the second semantic unit is not limited, for example, the first semantic unit and the second semantic unit may respectively include one or more clauses.
F (k) may be used to represent the optimal composite translation quality score for the first k clauses. In practical applications, for F (k), a corresponding initial value may be preset, for example, the initial value =0 of F [0] corresponding to k =0, the initial value = -INF (minus infinity) of F [ i ] corresponding to k being greater than 0, and the like. As can be seen, the value of F (0) can be obtained by presetting; when k is greater than 0, the initial value corresponding to f (k) can be obtained by presetting, and the final value corresponding to f (k) can be obtained by iteration, for example, the final value corresponding to f (k) when k is greater than 0 can be obtained by the following formula (1).
Assuming that the optimal subset comprehensive translation quality score corresponding to the first semantic unit is f (k), and the translation quality score of the second semantic unit is NMT _ score (k, i), the process of integrating f (k) and the translation quality score of the second semantic unit may include: f (k) and NMT _ score (k, i) are summed, or multiplied, or weighted average processed, etc., it is understood that the embodiment of the present invention does not limit the specific process of integrating f (k) and the translation quality score of the second semantic unit.
In practical applications, for the first i clauses, the corresponding dividing point k may be located at any position corresponding to the first i clauses, so that the dividing points corresponding to the first i clauses are, for example, the subset { C } 1 C 2 C 3 The number of the corresponding division points k may be 0,1,2,3, etc. Accordingly, a target segmentation point corresponding to the optimal comprehensive translation quality score can be obtained from at least one segmentation point k corresponding to the first i clauses according to the first i clauses and the comprehensive translation quality score F (i, k) corresponding to the segmentation point k.
In the embodiment of the present invention, the optimal comprehensive translation quality score may be measured by the size of the comprehensive translation quality score, and if F (i, k) = F [ k ] + NMT _ score (k, i), the optimal comprehensive translation quality score corresponding to the first i clauses and the target segmentation point corresponding to the optimal comprehensive translation quality score may be represented as:
F[i]=max(F[k]+NMT_score(k,i)) (1)
index[i]=argmax(F[k]+NMT_score(k,i))(2)
index [ i ] may be used to represent the maximum (F [ k ] + NMT _ score (k, i)) corresponding k value. In practical application, the optimal subset comprehensive translation quality scores F (i) corresponding to the previous i sentences and the corresponding backtracking segmentation points can be sequentially recurrently solved according to the sequence from small to large of i.
Optionally, the method of the embodiment of the present invention may further include: recording backtracking segmentation points of the sentence-breaking result of the optimal subset corresponding to each subset of the sentence-breaking sequence set; or, recording the mapping relation between the information of each subset of the sentence segmentation sequence set and the backtracking segmentation points of the sentence segmentation result of the optimal subset corresponding to the information, so as to obtain the corresponding recorded content. The information of the subset of the sentence sequence set may include: the number information of the last sentence corresponding to the subset, and/or the number information corresponding to the subset, etc. For example, for the first i clauses, the corresponding number information may be i, and the corresponding number information may correspond to the information of the last clause, that is, the i-th clause. It is to be understood that the embodiments of the present invention do not impose limitations on the specific information of the subsets.
In an optional embodiment of the present invention, the obtaining an optimal sentence break result corresponding to the text to be processed according to the backtracking segmentation points of the optimal subset sentence break result corresponding to each subset of the sentence break sequence set may specifically include:
backtracking the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set to obtain the backtracking segmentation points of the optimal subset sentence-break result corresponding to the maximum subset of the sentence-sequence set;
and segmenting the text to be processed according to the backtracking segmentation point of the optimal subset punctuation result corresponding to the maximum subset of the punctuation sequence set so as to obtain the optimal punctuation result corresponding to the text to be processed.
Optionally, the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set may specifically include:
acquiring first backtracking segmentation points P1 corresponding to the first i clauses;
and acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
In practical application, backtracking of the backtracking split points can be performed according to the sequence from i to i, and taking the obtaining process of the backtracking split points corresponding to the previous M clauses as an example, the first backtracking split point P1 corresponding to the previous M clauses can be determined first, for example, the first backtracking split point P1 corresponding to the previous M clauses can be queried from the recorded content; the first backtracking segmentation point P1 can obtain the optimal subset sentence break result corresponding to the first M sentences; then, the second backtracking segmentation point P2 corresponding to the first P1 clauses is obtained from the recorded content, for example, the second backtracking segmentation point P2 corresponding to the first P1 clauses can be queried from the recorded content; the second backtracking segmentation point P2 can obtain the optimal subset sentence-breaking result corresponding to the first P1 sentences, if P1 or P2 is equal to 0, the backtracking can be ended, otherwise, if P1 or P2 is not equal to 0, the backtracking can be continued.
In order to make those skilled in the art better understand the segmentation processing procedure of the embodiment of the present invention, the processing procedure of the embodiment of the present invention is described herein by an example, where the example relates to processing a text to be processed [ a, B, C ], and the corresponding processing procedure may specifically include the following steps:
step S1, obtaining a clause sequence set { [ A, B ], [ A, B ], [ A, B, C ] } corresponding to the text [ A, B, C ] to be processed;
assuming that S (i, j) represents a sequence of clauses from the u-th comma to the v-th preset punctuation mark, S (0,1) = a, S (1,2) = B, S (2,3) = C, S (0,2) = a, B, S (1,3) = B, C, S (0,3) = a, B, C.
Further assume that the translation quality scores of the sentences corresponding to S (i, j) are:
NMT_score(0,1)=-10
NMT_score(1,2)=-15
NMT_score(2,3)=-20
NMT_score(0,2)=-2
NMT_score(1,3)=-5
NMT_score(0,3)=-30
step S2, using F (i) to represent the optimal subset comprehensive translation quality score corresponding to the first i clauses, where the initial value of F [0] is =0, and i is greater than the initial value of F [ i ] corresponding to 0 = -INF (negative infinity);
step S3, when i =0, synthesizing the translation quality score F (0) =0 for the optimal subset corresponding to the first 0 continuous clauses;
step S4, when i =1, if the corresponding division point k =0, then
F[1]=max(F[0]+NMT_score(0,1))=-10
index[1]=0;
Step S5, when i =2, if the corresponding division point k =0,1, then
F[2]=max(F[0]+NMT_score(0,2),F[1]+NMT_score(1,2))= F[0]+NMT_score(0,2)=-2
index[2]=0;
Step S6, when i =3, if the corresponding division point k =0,1,2, then
F[3]=max(F[0]+NMT_score(0,3),F[1]+NMT_score(1,3),F[2]+NMT_score(2,3))= F[1]+NMT_score(1,3)=-15
index[3]=1;
Step S7, backtracking the backtracking segmentation points corresponding to the F (3);
wherein, a backtracking segmentation point P1=1 corresponding to F (3) may be obtained first, and then a backtracking segmentation point P2=0 corresponding to F (1) may be obtained, that is, the text [ a, B, C ] to be processed may be broken into 2 sentences, and the corresponding backtracking segmentation points are: p =0 and P =1, that is, the 2 sentences obtained by segmentation are respectively located after the 0 th clause and the 1 st clause, so that the corresponding optimal sentence-breaking results "a" and "B, C" can be obtained.
It can be understood that the above-mentioned texts to be processed [ a, B, C ] are only used as optional embodiments, and it can be understood that a person skilled in the art can process any texts to be processed according to the actual application requirements to obtain the corresponding optimal sentence-breaking result.
In summary, the processing method of the embodiment of the present invention obtains an optimal sentence break result corresponding to a text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed; because the comprehensive translation quality of the optimal sentence-break result of the embodiment of the present invention is optimal, the optimal sentence-break result may include: the comprehensive translation quality can be the comprehensive translation quality corresponding to all sentences contained in a sentence breaking result; therefore, the optimal punctuation result of the embodiment of the invention can realize the global optimization of the comprehensive translation quality, and the optimal punctuation result of the embodiment of the invention can improve the translation quality of the punctuation result corresponding to the text to be processed.
It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.
Device embodiment
Referring to fig. 4, a block diagram of a processing apparatus according to an embodiment of the present invention is shown, which may specifically include:
a to-be-processed text acquiring module 401, configured to acquire a to-be-processed text;
an optimal sentence break result obtaining module 402, configured to obtain an optimal sentence break result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed; the comprehensive translation quality of the optimal sentence-punctuating result is optimal, and the optimal sentence-punctuating result may include: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the optimal sentence-breaking result; and
and an optimal sentence break result output module 403, configured to output an optimal sentence break result corresponding to the processed text.
Optionally, the optimal sentence-breaking result obtaining module 402 may include:
and the dynamic programming acquisition sub-module is used for acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed by using a dynamic programming algorithm.
Optionally, the dynamic programming acquisition sub-module may include:
the sentence sequence set determining unit is used for determining a sentence sequence set corresponding to the text to be processed according to preset punctuations contained in the text to be processed;
a recursion unit, configured to determine, in a recursion manner, a backtracking segmentation point of each subset corresponding to an optimal subset sentence-breaking result according to a sequence from small to large of the subsets of the clause sequence set; and
and the optimal sentence break result acquisition unit is used for acquiring an optimal sentence break result corresponding to the text to be processed according to the backtracking segmentation points of the optimal subset sentence break result corresponding to each subset of the sentence sequence set.
Optionally, the subset of the set of sentence sequences may include: the first i clauses of the text to be processed, where the optimal subset comprehensive translation quality score corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and the recursion unit may include:
the subset sentence-breaking unit is used for breaking the first i sentences by using a segmentation point k so as to obtain optimal subset comprehensive translation quality scores F (k) of the first semantic units corresponding to the first i sentences and the segmentation point k and translation quality scores of second semantic units; wherein the first semantic unit may include: the first i sentences may include sentences located before the partition point k, and the second semantic unit may include: the first i clauses can include clauses positioned after the dividing point k, and k is more than or equal to 0 and less than i;
the quality comprehensive subunit is used for synthesizing the F (k) and the translation quality scores of the second semantic unit to obtain comprehensive translation quality scores corresponding to the first i clauses and the segmentation points k;
a target segmentation point obtaining subunit, configured to obtain, according to the i preceding clauses and the comprehensive translation quality score corresponding to the segmentation point k, a target segmentation point corresponding to an optimal comprehensive translation quality score from at least one segmentation point k corresponding to the i preceding clauses;
a backtracking segmentation point obtaining subunit, configured to use the target segmentation point as a backtracking segmentation point of the optimal subset sentence-break result corresponding to the first i sentences, and use the comprehensive translation quality score corresponding to the target segmentation point as an optimal subset comprehensive translation quality score f (i) corresponding to the first i sentences.
Optionally, the optimal sentence-break result obtaining unit may include:
a backtracking subunit, configured to backtrack the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set, so as to obtain the backtracking segmentation points of the optimal subset sentence-break result corresponding to the maximum subset of the sentence-sequence set;
and the sentence backtracking and breaking unit is used for carrying out sentence breaking on the text to be processed according to the backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the maximum subset of the sentence-splitting sequence set so as to obtain the optimal sentence-breaking result corresponding to the text to be processed.
Optionally, the backtracking subunit may include:
the first backtracking unit is used for acquiring first backtracking segmentation points P1 corresponding to the previous i clauses;
and the second backtracking unit is used for acquiring a second backtracking segmentation point P2 corresponding to a clause which is positioned before the first backtracking segmentation point P1 and can be included in the text to be processed.
Optionally, the optimal sentence-breaking result obtaining module 402 may include:
the exhaustion submodule is used for carrying out sentence breaking processing on the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed so as to obtain a plurality of sentence breaking results corresponding to the text to be processed;
the comprehensive quality determining submodule is used for determining the comprehensive translation quality corresponding to the sentence-breaking result;
and the result selection submodule is used for selecting a sentence breaking result with optimal comprehensive translation quality from the multiple sentence breaking results corresponding to the text to be processed as the optimal sentence breaking result corresponding to the text to be processed.
Optionally, the preset punctuation marks may include: commas and/or semicolons.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 5 is a block diagram illustrating an apparatus for processing as a terminal according to an example embodiment. For example, the terminal 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
Referring to fig. 5, terminal 900 can include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.
Processing component 902 generally controls overall operation of terminal 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
Memory 904 is configured to store various types of data to support operation at terminal 900. Examples of such data include instructions for any application or method operating on terminal 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power components 906 provide power to the various components of the terminal 900. The power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal 900.
The multimedia components 908 include a screen providing an output interface between the terminal 900 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 900 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when terminal 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 914 includes one or more sensors for providing various aspects of state assessment for the terminal 900. For example, sensor assembly 914 can detect an open/closed state of terminal 900, a relative positioning of components, such as a display and keypad of terminal 900, a change in position of terminal 900 or a component of terminal 900, the presence or absence of user contact with terminal 900, an orientation or acceleration/deceleration of terminal 900, and a change in temperature of terminal 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
Communication component 916 is configured to facilitate communications between terminal 900 and other devices in a wired or wireless manner. Terminal 900 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the terminal 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 904 comprising instructions, executable by processor 920 of terminal 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
FIG. 6 is a block diagram illustrating an apparatus for processing as a server in accordance with an example embodiment. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a sequence of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as a memory 1932 that includes instructions executable by a processor 1922 of a server 1900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of a server, enable an apparatus (server or terminal) to perform a method of processing, the method comprising: acquiring a text to be processed; acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises: the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the optimal sentence-breaking result; and outputting the optimal sentence-breaking result corresponding to the processed text.
Optionally, the obtaining an optimal sentence-breaking result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed includes: and acquiring an optimal sentence-breaking result corresponding to the text to be processed by utilizing a dynamic programming algorithm according to segmentation points obtained based on preset punctuation marks contained in the text to be processed.
Optionally, the obtaining, by using a dynamic programming algorithm, an optimal sentence break result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed includes:
determining a clause sequence set corresponding to the text to be processed according to preset punctuation marks contained in the text to be processed;
determining backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset in a recursion mode according to the sequence of the subsets of the sentence-breaking sequence set from small to large; the comprehensive translation quality corresponding to the optimal subset sentence-breaking result is optimal;
and obtaining the optimal sentence-breaking result corresponding to the text to be processed according to the backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset of the sentence-breaking sequence set.
Optionally, the subset of the set of sentence sequences comprises: the first i clauses of the text to be processed, the optimal subset comprehensive translation quality score corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and then the backtracking segmentation points of the sentence-breaking results of the optimal subsets corresponding to the subsets are determined in a recursion manner according to the sequence from small to large of the subsets of the clause sequence set, and the backtracking segmentation points comprise:
segmenting the first i clauses by using a segmentation point k to obtain optimal subset comprehensive translation quality scores F (k) of the first i clauses and a first semantic unit corresponding to the segmentation point k and translation quality scores of a second semantic unit; wherein the first semantic unit comprises: the first i clauses include clauses before a partition point k, and the second semantic unit includes: k is more than or equal to 0 and less than i in the clauses which are included in the first i clauses and are positioned behind the dividing point k;
synthesizing the translation quality scores of the F (k) and the second semantic unit to obtain a comprehensive translation quality score corresponding to the first i clauses and the segmentation point k;
according to the first i clauses and the comprehensive translation quality scores corresponding to the segmentation points k, obtaining target segmentation points corresponding to the optimal comprehensive translation quality scores from at least one segmentation point k corresponding to the first i clauses;
and taking the target segmentation point as a backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point as the comprehensive translation quality score F (i) of the optimal subset corresponding to the first i clauses.
Optionally, the obtaining the optimal sentence-breaking result corresponding to the text to be processed according to the backtracking segmentation point of the optimal subset sentence-breaking result corresponding to each subset of the sentence-breaking sequence set includes: backtracking the backtracking segmentation points of the optimal subset sentence-breaking results corresponding to all subsets of the sentence-sequence set to obtain the backtracking segmentation points of the optimal subset sentence-breaking results corresponding to the maximum subset of the sentence-sequence set; and carrying out sentence breaking on the text to be processed according to the backtracking segmentation point of the sentence breaking result of the optimal subset corresponding to the maximum subset of the sentence splitting sequence set so as to obtain the optimal sentence breaking result corresponding to the text to be processed.
Optionally, the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set includes: acquiring first backtracking segmentation points P1 corresponding to the first i clauses; and acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
Optionally, the obtaining an optimal sentence break result corresponding to the text to be processed according to a segmentation point obtained based on a preset punctuation mark included in the text to be processed includes: carrying out sentence-breaking processing on the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed so as to obtain a plurality of sentence-breaking results corresponding to the text to be processed; determining the comprehensive translation quality corresponding to the sentence-breaking result; and selecting a sentence break result with optimal comprehensive translation quality from the multiple sentence break results corresponding to the text to be processed as the optimal sentence break result corresponding to the text to be processed.
Optionally, the preset punctuation mark comprises: commas and/or semicolons.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.
The processing method, the processing apparatus, and the processing apparatus provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the descriptions of the above examples are only used to help understanding the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (16)

1. A method of processing, comprising:
acquiring a text to be processed;
acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result;
outputting an optimal sentence-breaking result corresponding to the processed text;
the obtaining of the optimal sentence-break result corresponding to the text to be processed includes: determining a clause sequence set corresponding to the text to be processed according to preset punctuation marks contained in the text to be processed; determining backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset in a recursion mode according to the sequence of the subsets of the sentence-splitting sequence set from small to large; the comprehensive translation quality corresponding to the optimal subset sentence-breaking result is optimal; obtaining an optimal sentence-breaking result corresponding to the text to be processed according to backtracking segmentation points of optimal subset sentence-breaking results corresponding to all subsets of the sentence-breaking sequence set;
the subset of the set of sentence sequences comprises: the first i clauses of the text to be processed, the comprehensive translation quality score of the optimal subset corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and then the backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to each subset is determined in a recursion manner, and the backtracking segmentation point comprises: segmenting the first i clauses by using a segmentation point k to obtain optimal subset comprehensive translation quality scores F (k) of the first i clauses and a first semantic unit corresponding to the segmentation point k and translation quality scores of a second semantic unit; wherein the first semantic unit comprises: the first i clauses include clauses before a partition point k, and the second semantic unit includes: k is more than or equal to 0 and less than i of clauses which are included in the first i clauses and are positioned after the segmentation point k; synthesizing the translation quality scores of the F (k) and the second semantic unit to obtain a comprehensive translation quality score corresponding to the first i clauses and the segmentation point k; according to the first i clauses and the comprehensive translation quality scores corresponding to the segmentation points k, obtaining target segmentation points corresponding to the optimal comprehensive translation quality scores from at least one segmentation point k corresponding to the first i clauses; and taking the target segmentation point as a backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point as the comprehensive translation quality score F (i) of the optimal subset corresponding to the first i clauses.
2. The method according to claim 1, wherein obtaining the optimal sentence-break result corresponding to the text to be processed according to the backtracking segmentation points of the sentence-break result corresponding to the optimal subset of each subset of the sentence-sequence set comprises:
backtracking the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set to obtain the backtracking segmentation points of the optimal subset sentence-break result corresponding to the maximum subset of the sentence-sequence set;
and carrying out sentence breaking on the text to be processed according to the backtracking segmentation point of the sentence breaking result of the optimal subset corresponding to the maximum subset of the sentence splitting sequence set so as to obtain the optimal sentence breaking result corresponding to the text to be processed.
3. The method of claim 2, wherein said backtracking each subset of the set of sentence-sequences for a backtracking split point of an optimal subset sentence-break result comprises:
acquiring first backtracking segmentation points P1 corresponding to the first i clauses;
and acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
4. The method according to claim 1, wherein the obtaining of the optimal sentence-break result corresponding to the text to be processed according to the segmentation points obtained based on the preset punctuation marks included in the text to be processed further comprises:
carrying out sentence-breaking processing on the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed so as to obtain a plurality of sentence-breaking results corresponding to the text to be processed;
determining the comprehensive translation quality corresponding to the sentence-breaking result;
and selecting a sentence break result with optimal comprehensive translation quality from the multiple sentence break results corresponding to the text to be processed as the optimal sentence break result corresponding to the text to be processed.
5. The method of claim 1 or 4, wherein the preset punctuation marks comprise: commas and/or semicolons.
6. A processing apparatus, comprising:
the text to be processed acquisition module is used for acquiring a text to be processed;
the optimal sentence-breaking result acquisition module is used for acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result; and
the optimal sentence-breaking result output module is used for outputting an optimal sentence-breaking result corresponding to the processed text;
the optimal sentence-breaking result obtaining module comprises:
the sentence sequence set determining unit is used for determining a sentence sequence set corresponding to the text to be processed according to preset punctuations contained in the text to be processed;
a recursion unit, configured to determine, in a recursion manner, a backtracking segmentation point of each subset corresponding to an optimal subset sentence-breaking result according to a sequence from small to large of the subsets of the clause sequence set; and
an optimal sentence break result obtaining unit, configured to obtain an optimal sentence break result corresponding to the text to be processed according to a backtracking segmentation point of an optimal subset sentence break result corresponding to each subset of the sentence segmentation sequence set;
the subset of the set of sentence sequences comprises: the optimal subset comprehensive translation quality scores corresponding to the first i clauses of the text to be processed are expressed as F (i), i is greater than or equal to 0 and less than or equal to the clause number M of the text to be processed, and then the recursion unit comprises:
a subset sentence-breaking unit, configured to perform sentence breaking on the first i sentences by using a partition point k, so as to obtain optimal subset comprehensive translation quality scores f (k) of the first i sentences and the first semantic units corresponding to the partition point k, and translation quality scores of the second semantic units; wherein the first semantic unit comprises: the first i clauses include clauses before a partition point k, and the second semantic unit includes: k is more than or equal to 0 and less than i in the clauses which are included in the first i clauses and are positioned behind the dividing point k;
the quality comprehensive subunit is used for synthesizing the F (k) and the translation quality scores of the second semantic unit to obtain comprehensive translation quality scores corresponding to the first i clauses and the segmentation points k;
a target segmentation point acquisition subunit, configured to acquire, according to the i preceding clauses and the comprehensive translation quality score corresponding to the segmentation point k, a target segmentation point corresponding to an optimal comprehensive translation quality score from at least one segmentation point k corresponding to the i preceding clauses;
and the backtracking segmentation point acquisition subunit is used for taking the target segmentation point as a backtracking segmentation point of the optimal subset sentence-breaking result corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point as an optimal subset comprehensive translation quality score F (i) corresponding to the first i clauses.
7. The apparatus of claim 6, wherein the optimal sentence break result obtaining unit comprises:
a backtracking subunit, configured to backtrack the backtracking segmentation points of the optimal subset sentence-break result corresponding to each subset of the sentence-sequence set, so as to obtain the backtracking segmentation points of the optimal subset sentence-break result corresponding to the maximum subset of the sentence-sequence set;
and the sentence backtracking and breaking unit is used for performing sentence breaking on the text to be processed according to the backtracking and dividing points of the optimal subset sentence breaking result corresponding to the maximum subset of the sentence splitting sequence set so as to obtain the optimal sentence breaking result corresponding to the text to be processed.
8. The apparatus of claim 7, wherein the trace-back subunit comprises:
the first backtracking unit is used for acquiring first backtracking segmentation points P1 corresponding to the previous i clauses;
and the second backtracking unit is used for acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
9. The apparatus of claim 6, wherein the optimal sentence break result obtaining module further comprises:
the exhaustion submodule is used for carrying out sentence breaking processing on the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed so as to obtain a plurality of sentence breaking results corresponding to the text to be processed;
the comprehensive quality determining submodule is used for determining the comprehensive translation quality corresponding to the sentence-breaking result;
and the result selection submodule is used for selecting the sentence breaking result with the optimal comprehensive translation quality from the multiple sentence breaking results corresponding to the text to be processed as the optimal sentence breaking result corresponding to the text to be processed.
10. The apparatus of claim 6 or 9, wherein the preset punctuation marks comprise: commas and/or semicolons.
11. An apparatus for processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:
acquiring a text to be processed;
acquiring an optimal sentence-breaking result corresponding to the text to be processed according to segmentation points obtained based on preset punctuation marks contained in the text to be processed; the comprehensive translation quality of the optimal sentence-breaking result is optimal, and the optimal sentence-breaking result comprises the following steps: at least one sentence, wherein the comprehensive translation quality is the comprehensive translation quality corresponding to all sentences contained in the sentence breaking result;
outputting an optimal sentence-breaking result corresponding to the processed text;
the obtaining of the optimal sentence-break result corresponding to the text to be processed includes: determining a clause sequence set corresponding to the text to be processed according to preset punctuation marks contained in the text to be processed; determining backtracking segmentation points of the optimal subset sentence-breaking result corresponding to each subset in a recursion mode according to the sequence of the subsets of the sentence-splitting sequence set from small to large; the comprehensive translation quality corresponding to the optimal subset sentence-breaking result is optimal; obtaining an optimal sentence-breaking result corresponding to the text to be processed according to backtracking segmentation points of optimal subset sentence-breaking results corresponding to all subsets of the sentence-breaking sequence set;
the subset of the set of sentence sequences comprises: the first i clauses of the text to be processed, the optimal subset comprehensive translation quality score corresponding to the first i clauses is represented as f (i), i is greater than or equal to 0 and less than or equal to the number M of the clauses of the text to be processed, and then the backtracking segmentation points of the sentence-breaking results of the optimal subsets corresponding to the subsets are determined in a recursion manner according to the sequence from small to large of the subsets of the clause sequence set, and the backtracking segmentation points comprise:
segmenting the first i sentences by using segmentation points k to obtain optimal subset comprehensive translation quality scores F (k) of the first i sentences and the first semantic units corresponding to the segmentation points k and translation quality scores of the second semantic units; wherein the first semantic unit comprises: the first i clauses include clauses before a partition point k, and the second semantic unit includes: k is more than or equal to 0 and less than i of clauses which are included in the first i clauses and are positioned after the segmentation point k;
synthesizing the translation quality scores of the F (k) and the second semantic unit to obtain a comprehensive translation quality score corresponding to the first i sentences and the segmentation points k;
according to the first i sentences and the comprehensive translation quality scores corresponding to the segmentation points k, acquiring target segmentation points corresponding to the optimal comprehensive translation quality scores from at least one segmentation point k corresponding to the first i sentences;
and taking the target segmentation point as a backtracking segmentation point of the sentence-breaking result of the optimal subset corresponding to the first i clauses, and taking the comprehensive translation quality score corresponding to the target segmentation point as the comprehensive translation quality score F (i) of the optimal subset corresponding to the first i clauses.
12. The apparatus according to claim 11, wherein said obtaining the optimal sentence-breaking result corresponding to the text to be processed according to the backtracking segmentation points of the sentence-breaking result corresponding to the optimal subset of each subset of the sentence-sequence set comprises:
backtracking the backtracking segmentation points of the optimal subset sentence-breaking results corresponding to all subsets of the sentence-sequence set to obtain the backtracking segmentation points of the optimal subset sentence-breaking results corresponding to the maximum subset of the sentence-sequence set;
and carrying out sentence breaking on the text to be processed according to the backtracking segmentation point of the sentence breaking result of the optimal subset corresponding to the maximum subset of the sentence splitting sequence set so as to obtain the optimal sentence breaking result corresponding to the text to be processed.
13. The apparatus of claim 11, wherein said backtracking each subset of the set of sentence-sequences for a backtracking split point of an optimal subset sentence-break result comprises:
acquiring first backtracking segmentation points P1 corresponding to the first i clauses;
and acquiring a second backtracking segmentation point P2 corresponding to a clause which is included in the text to be processed and is positioned before the first backtracking segmentation point P1.
14. The apparatus according to claim 11, wherein the obtaining of the optimal sentence-break result corresponding to the text to be processed according to the segmentation point obtained based on the preset punctuation mark included in the text to be processed further comprises:
carrying out sentence-breaking processing on the text to be processed according to segmentation points obtained based on preset punctuations included in the text to be processed so as to obtain a plurality of sentence-breaking results corresponding to the text to be processed;
determining the comprehensive translation quality corresponding to the sentence breaking result;
and selecting a sentence break result with optimal comprehensive translation quality from the multiple sentence break results corresponding to the text to be processed as the optimal sentence break result corresponding to the text to be processed.
15. The apparatus of claim 11 or 14, wherein the pre-set punctuation marks comprise: commas and/or semicolons.
16. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-5.
CN201710157267.5A 2017-03-16 2017-03-16 Processing method and device for processing Active CN108628819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710157267.5A CN108628819B (en) 2017-03-16 2017-03-16 Processing method and device for processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710157267.5A CN108628819B (en) 2017-03-16 2017-03-16 Processing method and device for processing

Publications (2)

Publication Number Publication Date
CN108628819A CN108628819A (en) 2018-10-09
CN108628819B true CN108628819B (en) 2022-09-20

Family

ID=63687489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710157267.5A Active CN108628819B (en) 2017-03-16 2017-03-16 Processing method and device for processing

Country Status (1)

Country Link
CN (1) CN108628819B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing
CN109920406B (en) * 2019-03-28 2021-12-03 国家计算机网络与信息安全管理中心 Dynamic voice recognition method and system based on variable initial position
CN110321532A (en) * 2019-06-06 2019-10-11 数译(成都)信息技术有限公司 Language pre-processes punctuate method, computer equipment and computer readable storage medium
CN111046649A (en) * 2019-11-22 2020-04-21 北京捷通华声科技股份有限公司 Text segmentation method and device
CN114420102B (en) * 2022-01-04 2022-10-14 广州小鹏汽车科技有限公司 Method and device for speech sentence-breaking, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915264A (en) * 2015-05-29 2015-09-16 北京搜狗科技发展有限公司 Input error-correction method and device
CN105912522A (en) * 2016-03-31 2016-08-31 长安大学 Automatic extraction method and extractor of English corpora based on constituent analyses
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458681A (en) * 2007-12-10 2009-06-17 株式会社东芝 Voice translation method and voice translation apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915264A (en) * 2015-05-29 2015-09-16 北京搜狗科技发展有限公司 Input error-correction method and device
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
CN105912522A (en) * 2016-03-31 2016-08-31 长安大学 Automatic extraction method and extractor of English corpora based on constituent analyses

Also Published As

Publication number Publication date
CN108628819A (en) 2018-10-09

Similar Documents

Publication Publication Date Title
CN107632980B (en) Voice translation method and device for voice translation
CN107291690B (en) Punctuation adding method and device and punctuation adding device
CN108628819B (en) Processing method and device for processing
CN107221330B (en) Punctuation adding method and device and punctuation adding device
CN106971723B (en) Voice processing method and device for voice processing
CN107291704B (en) Processing method and device for processing
WO2021128880A1 (en) Speech recognition method, device, and device for speech recognition
CN110210310B (en) Video processing method and device for video processing
CN108628813B (en) Processing method and device for processing
CN107274903B (en) Text processing method and device for text processing
WO2018076450A1 (en) Input method and apparatus, and apparatus for input
CN111368541B (en) Named entity identification method and device
CN108399914B (en) Voice recognition method and device
CN111128183B (en) Speech recognition method, apparatus and medium
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN108304412B (en) Cross-language search method and device for cross-language search
RU2733816C1 (en) Method of processing voice information, apparatus and storage medium
CN111831806B (en) Semantic integrity determination method, device, electronic equipment and storage medium
CN108073572B (en) Information processing method and device, simultaneous interpretation system
CN111369978B (en) Data processing method and device for data processing
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN111640452B (en) Data processing method and device for data processing
CN107422872B (en) Input method, input device and input device
CN113343675B (en) Subtitle generation method and device and subtitle generation device
CN111192586A (en) Voice recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant