WO2019023893A1 - System and method for segmenting a sentence - Google Patents

System and method for segmenting a sentence Download PDF

Info

Publication number
WO2019023893A1
WO2019023893A1 PCT/CN2017/095305 CN2017095305W WO2019023893A1 WO 2019023893 A1 WO2019023893 A1 WO 2019023893A1 CN 2017095305 W CN2017095305 W CN 2017095305W WO 2019023893 A1 WO2019023893 A1 WO 2019023893A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
sentence
segmentation
evaluation score
score
Prior art date
Application number
PCT/CN2017/095305
Other languages
French (fr)
Inventor
Jie Bai
Xiulin Li
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Priority to EP17920149.6A priority Critical patent/EP3642733A4/en
Priority to CN201780093452.0A priority patent/CN110945514B/en
Priority to PCT/CN2017/095305 priority patent/WO2019023893A1/en
Priority to TW107125631A priority patent/TWI676167B/en
Publication of WO2019023893A1 publication Critical patent/WO2019023893A1/en
Priority to US16/749,956 priority patent/US11132506B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present disclosure relates to Text-to-Speech (TS) techniques, and more particularly, to segmenting a text sentence.
  • TS Text-to-Speech
  • Text-to-Speech techniques can transcribe text information into audio signals.
  • text information such as traffic condition, addresses, or the like may be presented to a user by voice.
  • books and news may be read to the user as well by means of TS techniques.
  • a piece of text e.g., a sentence
  • each of the phrases that are included in a sentence contains one or more words.
  • a word can be an English, French, Spanish, etc. word in the Latin language, or a character in Asian languages such as Chinese, Korean, Japanese, etc. These words or characters may be segmented into phrases in a plurality of possible combinations.
  • a sentence “The man over there is watching TV” may be segmented according to two segmentation paths as below:
  • a first segmentation path “The man/over/there is/watching TV” .
  • a second segmentation path “The man/over there/is/watching TV” .
  • the transcription may make no sense to the user.
  • conventional segmentation systems and methods cannot determine which segmentation path is better, as each phrase in both segmentation paths seems linguistically reasonable.
  • embodiments of the disclosure provided improved systems and methods for segmenting a sentence.
  • An aspect of the disclosure is directed to a method for segmenting a sentence.
  • the method may include identifying, by a processor, a first phrase in the sentence associated with a first segmentation path; determining, by the processor, a first group of derivative phrases semantically associated with the first phrase; determining, by the processor, a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and segmenting the sentence based on the first evaluation score.
  • the system may include a communication interface configured for receiving the sentence; a memory configured for storing the sentence and a language model; and a processor configured for identifying a first phrase in the sentence associated with a first segmentation path; determining a first group of derivative phrases semantically associated with the first phrase; determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and segmenting the sentence based on the first evaluation score.
  • Yet another aspect of the disclosure is directed to a non-transitory computer-readable medium that stores a set of instructions, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence.
  • the method may include identifying a first phrase in the sentence associated with a first segmentation path; determining a first group of derivative phrases semantically associated with the first phrase; determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and segmenting the sentence based on the first evaluation score.
  • FIG. 1 is a block diagram of an exemplary system for segmenting a sentence, according to some embodiments of the disclosure.
  • FIG. 2 illustrates two exemplary segmentation paths of a Chinese sentence, according to some embodiments of the disclosure.
  • FIG. 3 is a flowchart of an exemplary method for segmenting a sentence, according to some embodiments of the disclosure.
  • FIG. 1 is a block diagram of an exemplary system 100 for segmenting a sentence, according to some embodiments of the disclosure.
  • System 100 may be a general server or a proprietary device for processing text information in a sentence.
  • system 100 may include a communication interface 102, a processor 104, and a memory 116.
  • Processor 104 may further include multiple functional modules, such as a tokenizer 106, a phrase identifier 108, a derivative phrase generator 110, a score determination unit 112, and a segmentation unit 114.
  • These modules can be functional hardware units (e.g., portions of an integrated circuit) of processor 104 designed for use with other components or a part of a program.
  • the program may be stored on a computer-readable medium, and when executed by processor 104, it may perform one or more functions.
  • FIG. 1 shows units 106-114 all within one processor 104, it is contemplated that these units may be distributed among multiple processors located near or remotely with each other.
  • System 100 may be implemented in the cloud, or a separate computer/server.
  • Communication interface 102 may be configured to receive one or more sentences 120.
  • Memory 116 may be configured to store the one or more sentences.
  • Memory 116 may be implemented as any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM) , an electrically erasable programmable read-only memory (EEPROM) , an erasable programmable read-only memory (EPROM) , a programmable read-only memory (PROM) , a read-only memory (ROM) , a magnetic memory, a flash memory, or a magnetic or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory a magnetic memory
  • flash memory or a magnetic or optical disk.
  • processor 104 may identify a first phrase in the sentence received by communication interface 102.
  • the first phrase is associated with a first segmentation path.
  • tokenizer 106 may segment a sentence in one or more segmentation paths. As described with Example I, “The man over there is watching TV” may be segmented as a first segmentation path of “The man/over/there is/watching TV” or a second segmentation path of “The man/over there/is/watching TV. ” A first phrase “there is” is associated with the first segmentation path and a second phrase “over there” is associated with the second segmentation path. Phases “there is” and “over there” may be identified by comparing the first and second segmentation paths and locating differences between the segmentation paths.
  • phrase identifier 108 may identify that the segments “over/there is” and “over there/is” are not identical by comparing the first and second segmentation paths. Then phrase identifier 108 may further identify “there is” as a first phrase associated with the first segmentation path and “over there” as a second phrase associated with the second segmentation path. It is contemplated that the first and second segments are different segments on a same part of the sentence. Therefore, the first and second phrases has at least one overlapping word (e.g., “there” ) .
  • Derivative phrase generator 110 may then determine a first group of derivative phrases semantically associated with the first phrase.
  • the derivative phrases may be determined based on semantic vectors between the first phrase and candidate phrases.
  • the candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared with the first phrase based on the semantic vectors.
  • the semantic vector is an algebraic model for representing text documents (e.g., a phrase) as vectors.
  • the difference between two semantic vectors associated with phrases may be represented by a cosine distance between the two semantic vectors.
  • the first group of derivative phrases may include synonyms of the first phrase.
  • communication interface 102 may be further configured to retrieve from a phrase database 122 synonyms corresponding to the phrases.
  • the first phrase “there is” may have several synonyms, such as “there are” , “here is” , “exist” , “have” , “has” , or the like.
  • processor 104 may further determine a second group of derivative phrases semantically associated with the second phrase.
  • the second group of derivative phrases associated with the second phrase “over there” may have synonyms, such as “there” , “over here” , or the like.
  • score determination unit 112 may replace the first phrase with the respective derivative phrases in the first group, and determine a first evaluation score based on the modified sentences. For example, the sentence “The man/over/there is/watching TV” may be modified by replacing the first phrase “there is” with the synonyms of “there are” , “here is” , “exist” , “have” , “has” , or the like.
  • a plurality of modified sentences may be generated, including “The man/over/there are/watching TV” , “The man/over/here is/watching TV” , “The man/over/exist/watching TV” , “The man/over/have/watching TV” , “The man/over/has/watching TV” , or the like.
  • a language model score may be determined for each of the modified sentences using a language model, and the language model scores may be averaged to derive the first evaluation score.
  • the language model can evaluate a segmentation path according to natural language rules.
  • the modified sentences “The man/over/there are/watching TV” , “The man/over/here is/watching TV” , “The man/over/exist/watching TV” , “The man/over/have/watching TV” , “The man/over/has/watching TV” may be respectively evaluated as 47 points, 68 points, 35 points, 33 points, and 42 points. Accordingly, the first evaluation score may be an average of the above scores, i.e., 45 points.
  • the language model may determine that a singular subject going with a plural verb does not comply with the natural language rules, and therefore evaluate the modified sentence with a low score. It is contemplated that, whether a low score indicates a segmentation path as being improper or proper is not restrictive.
  • the original sentence including the first phrase may also be evaluated by the language model to generate a first base score.
  • the original sentence “The man/over/there is/watching TV” may be evaluated as 68 points, which is considered as the first base score.
  • score determination unit 112 may replace the second phrase (e.g., “over there” ) with the respective derivative phrases in the second group, and determine a second evaluation score based on the modified sentences. For example, a second evaluation score may be determined as 67 points and a second base score may be determined as 69 points.
  • the language model may be trained for a designated language, such as English, Chinese, Japanese, or the like. For sentences in Example I described above, an English language model may be used.
  • the language model may be a general model pre-stored in memory 116, or trained for a specific area (e.g., novel, legal, navigation, or the like) by, for example, machine learning.
  • the modified sentence should still be able to deliver similar meanings and follow natural language rules. That is, when the language model evaluates the segmentation paths based on the original sentence and modified sentence, the language model scores based on the original sentence and the modified sentence should be close.
  • the modified sentences may have drastically different meanings or even does not meet natural language rules. That is, a difference between the language model scores based on the original sentence and the modified sentence may be magnified, so that processor 104 may determine the segmentation path is not proper.
  • a threshold may be pre-determined, such as 5 points. When the difference is greater than or equal to the threshold, the corresponding segmentation path may be determined as an improper segmentation path. It is contemplated that, the pre-determined difference may be different for different language models.
  • Segmentation unit 114 may further segment the sentence based on the evaluation score (e.g., the first and/or second evaluation score) .
  • the evaluation score e.g., the first and/or second evaluation score
  • a difference between the first base score (e.g., 68 points) and the first evaluation score (e.g., 45 points) may be determined.
  • a difference between the second base score (e.g., 69 points) and the second evaluation score e.g., 67 points
  • the first base score (e.g., 68) corresponding to the first segmentation path and the second base score (e.g., 69) corresponding to the second segmentation path may be close, and selecting one segmentation path over another based on these base scores may sometimes cause errors. That is, a base score of 69 does not necessarily indicate that the corresponding segmentation path is better than one that has a base score of 68. However, by replacing the identified phrases in the first and second segmentation paths with synonyms, the difference between the first and second segmentation paths may be magnified so that segmentation unit 114 may more accurately select one of the segmentation paths based on the evaluation scores associated with respective segmentation paths.
  • segmentation unit 114 may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, as described above, the first and second evaluation scores are 45 points and 67 points, respectively. Therefore, segmentation unit 114 may select one of the first and second segmentation paths that has the higher evaluation score (e.g., the second segmentation path) , and segment the sentence according to the selected segmentation path.
  • the first and second evaluation scores are 45 points and 67 points, respectively. Therefore, segmentation unit 114 may select one of the first and second segmentation paths that has the higher evaluation score (e.g., the second segmentation path) , and segment the sentence according to the selected segmentation path.
  • system 100 may also process a sentence of text in another language, such as Chinese.
  • FIG. 2 further illustrates two exemplary segmentation paths of a Chinese sentence, according to some embodiments of the disclosure.
  • the Chinese sentence means that Associazione Calcio Fiorentina has the ability to compete withmeaning Football Club in Italian Serie A League. Pin’ yin spelling is marked for each Chinese character under the original Chinese sentence.
  • the Chinese sentence may be segmented according to first and second segmentation paths.
  • the first and second segmentation paths may be compared by tokenizer 106 to identify that segment 202 and segment 204 are different.
  • phrase identifier 108 may identify the phrase corresponding to “Yi Jia” in first segmentation path as a first phrase associated with the first segmentation path, and the phrase “Yi Jia” means “Italian Serie A League” in Chinese.
  • Phrase identifier 108 may also identify the phrase corresponding to “Zai Yi” in the second segmentation path, and the phrase “Zai Yi” means “care for” , “mind” , or the like. It can be noticed that, the phrase “Yi Jia” and the phrase “Zai Yi” have a same Chinese character “Yi” in common.
  • Derivative phrase generator 110 may determine a first group and a second group of derivative phrases semantically associated with the first phrase and the second phrase, respectively.
  • the derivative phrases semantically associated with the first phrase “Yi Jia” may include “Primera Divisi ⁇ n de ” “Premie League, ” “German Congress, ” or the like.
  • the derivative phrases semantically associated with the second phrase “Zai Yi” may include “care for, ” “mind, ” “care about, ” or the like.
  • modified sentences may be generated.
  • score determination unit 112 may then evaluate the first and second segmentation paths based on the original sentence and the modified sentences using a language model for Chinese. For example, the first and second segmentation paths based on the original sentence may be scored as 58.342 and 59.081 respectively, and the first and second segmentation paths based on the modified sentences may be scored as 58.561 and 34.952.
  • segmentation unit 114 may select the first segmentation path having a higher evaluation score (e.g., 58.561 as opposed to 34.952) , and segment the sentence accordingly. Note in this example, selecting the segmentation path based on the base scores would have resulted in an error, as the second segmentation path actually has a slightly higher base score.
  • a higher evaluation score e.g. 58.561 as opposed to 34.952
  • the above-described system 100 may magnify (and sometimes, reverse) the difference between two or more segmentation paths by replacing an identified phrase with synonyms and select a proper path for segmenting the sentence, when the language model cannot distinguish them.
  • FIG. 3 is a flowchart of an exemplary method 300 for segmenting a sentence, according to some embodiments of the disclosure.
  • method 300 may be implemented by a segmentation device, and may include steps S302-S308.
  • the segmentation device may identify a first phrase in the sentence associated with a first segmentation path.
  • the segmentation device may generate at least two segmentation paths of the sentence.
  • Each of the segmentation path may include a plurality of segments.
  • the segmentation device may compare the plurality of segments and identify a segment in a first segmentation path that is different from a corresponding segment in other segmentation paths.
  • the segmentation device may identify a segment in a second segmentation path.
  • the segmentation device may identify the first phrase in the sentence associated with the first segmentation path, and the second phrase in the sentence associated with the second segmentation path.
  • the first and second segmentation paths are different segmentation paths on a same sentence or a same part of the sentence, the first and second phrases may have at least one overlapping word.
  • the segmentation device may determine a first group of derivative phrases semantically associated with the first phrase.
  • the derivative phrases may be determined based on semantic vectors between the first phrase and candidate phrases.
  • the candidate phrases may be pre-stored in a phrases database, and each phrase in the database may be compared with the first phrase based on the semantic vectors.
  • the first group of derivative phrases may include synonyms of the first phrase.
  • the segmentation device may determine a second group of derivative phrases semantically associated with the second phrase.
  • the segmentation device may determine a first evaluation score based on the modified sentences generated by replacing the first phrase with the respective derivative phrases in the first group.
  • a language model score may be determined for each of the modified sentences using a language model, and the language model scores may be averaged to derive the first evaluation score.
  • the language model may evaluate a segmentation path according to natural language rules.
  • the segmentation device may further determine a second evaluation score based on the modified sentences generated by replacing the second phrase with the respective derivative phrases in the second group.
  • a first base score and a second base score may be also determined based the sentence. For example, the first and second segments of the sentence may be scored by the language model to generate the first and second base scores.
  • an averaged score is merely an example for evaluating the segmentation paths.
  • the individual scores may be manipulated or combined in any suitable ways to derive the evaluation score.
  • the evaluation score may be a weighted average of the individual scores, and the weights may correspond to how close the respective synonyms are to the phrase.
  • a variance of the individual language model scores for the modified sentences may be used to determine whether the language model scores vary significantly. If the language model scores vary significantly, it may indicate the corresponding segmentation path is not proper.
  • the segmentation device may segment the sentence based on the first evaluation score.
  • the segmentation device may compare the first evaluation score with the first base score, and segment the sentence according to the first segmentation path when the difference between the first base score and the first evaluation score is less than a threshold.
  • the evaluation score based on modified sentences should be close to the base score based on the original sentence.
  • the segmentation device may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, the segmentation device may select one of the first and second segmentation paths that has the higher evaluation score, and segment the sentence according to the selected segmentation path.
  • the computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices.
  • the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed.
  • the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Embodiments of the disclosure provide systems and methods for segmenting a sentence. The method may include identifying a first phrase in the sentence associated with a first segmentation path, determining a first group of derivative phrases semantically associated with the first phrase, determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group, and segmenting the sentence based on the first evaluation score.

Description

SYSTEM AND METHOD FOR SEGMENTING A SENTENCE
INTERNATIONAL PATENT APPLICATION
FOR
SYSTEM AND METHOD FOR SEGMENTING A SENTENCE
BY
JIE BAI AND XIULIN LI
SYSTEM AND METHOD FOR SEGMENTING A SENTENCE
TECHNICAL FIELD
The present disclosure relates to Text-to-Speech (TS) techniques, and more particularly, to segmenting a text sentence.
BACKGROUND
Text-to-Speech techniques can transcribe text information into audio signals. For example, in a navigation application (e.g., a DiDi app) , text information, such as traffic condition, addresses, or the like may be presented to a user by voice. And books and news may be read to the user as well by means of TS techniques.
To be read in a natural way, a piece of text (e.g., a sentence) must be segmented properly before being transcribed into audio signals. Generally, each of the phrases that are included in a sentence contains one or more words. Consistent with this disclosure, a word can be an English, French, Spanish, etc. word in the Latin language, or a character in Asian languages such as Chinese, Korean, Japanese, etc. These words or characters may be segmented into phrases in a plurality of possible combinations. In Example I, a sentence “The man over there is watching TV” may be segmented according to two segmentation paths as below:
A first segmentation path: “The man/over/there is/watching TV” .
A second segmentation path: “The man/over there/is/watching TV” .
When the audio signals are generated based on the first segmentation path, the transcription may make no sense to the user. However, conventional segmentation systems and methods cannot determine which segmentation path is better, as each phrase in both segmentation paths seems linguistically reasonable.
To address the above issue, embodiments of the disclosure provided improved systems and methods for segmenting a sentence.
SUMMARY
An aspect of the disclosure is directed to a method for segmenting a sentence. The method may include identifying, by a processor, a first phrase in the sentence associated with a first segmentation path; determining, by the processor, a first group of derivative phrases semantically associated with the first phrase; determining, by the processor, a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and segmenting the sentence based on the first evaluation score.
Another aspect of the disclosure is directed to a system for segmenting a sentence. The system may include a communication interface configured for receiving the sentence; a memory configured for storing the sentence and a language model; and a processor configured for identifying a first phrase in the sentence associated with a first segmentation path; determining a first group of derivative phrases semantically associated with the first phrase; determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and segmenting the sentence based on the first evaluation score.
Yet another aspect of the disclosure is directed to a non-transitory computer-readable medium that stores a set of instructions, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence. The method may include identifying a first phrase in the sentence associated with a first segmentation path; determining a first group of derivative phrases semantically associated  with the first phrase; determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and segmenting the sentence based on the first evaluation score.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of an exemplary system for segmenting a sentence, according to some embodiments of the disclosure.
FIG. 2 illustrates two exemplary segmentation paths of a Chinese sentence, according to some embodiments of the disclosure.
FIG. 3 is a flowchart of an exemplary method for segmenting a sentence, according to some embodiments of the disclosure.
DETAILED DESCRIPTION
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
FIG. 1 is a block diagram of an exemplary system 100 for segmenting a sentence, according to some embodiments of the disclosure.
System 100 may be a general server or a proprietary device for processing text information in a sentence. As shown in FIG. 1, system 100 may include a communication interface 102, a processor 104, and a memory 116. Processor 104 may further include multiple functional modules, such as a tokenizer 106, a phrase identifier 108, a derivative phrase  generator 110, a score determination unit 112, and a segmentation unit 114. These modules (and any corresponding sub-modules or sub-units) can be functional hardware units (e.g., portions of an integrated circuit) of processor 104 designed for use with other components or a part of a program. The program may be stored on a computer-readable medium, and when executed by processor 104, it may perform one or more functions. Although FIG. 1 shows units 106-114 all within one processor 104, it is contemplated that these units may be distributed among multiple processors located near or remotely with each other. System 100 may be implemented in the cloud, or a separate computer/server.
Communication interface 102 may be configured to receive one or more sentences 120. Memory 116 may be configured to store the one or more sentences. Memory 116 may be implemented as any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM) , an electrically erasable programmable read-only memory (EEPROM) , an erasable programmable read-only memory (EPROM) , a programmable read-only memory (PROM) , a read-only memory (ROM) , a magnetic memory, a flash memory, or a magnetic or optical disk.
Consistent with embodiments of the disclosure, processor 104 may identify a first phrase in the sentence received by communication interface 102. The first phrase is associated with a first segmentation path.
For example, tokenizer 106 may segment a sentence in one or more segmentation paths. As described with Example I, “The man over there is watching TV” may be segmented as a first segmentation path of “The man/over/there is/watching TV” or a second segmentation path of “The man/over there/is/watching TV. ” A first phrase “there is” is associated with the first segmentation path and a second phrase “over there” is associated with the second  segmentation path. Phases “there is” and “over there” may be identified by comparing the first and second segmentation paths and locating differences between the segmentation paths. With reference to Example I, phrase identifier 108 may identify that the segments “over/there is” and “over there/is” are not identical by comparing the first and second segmentation paths. Then phrase identifier 108 may further identify “there is” as a first phrase associated with the first segmentation path and “over there” as a second phrase associated with the second segmentation path. It is contemplated that the first and second segments are different segments on a same part of the sentence. Therefore, the first and second phrases has at least one overlapping word (e.g., “there” ) .
Derivative phrase generator 110 may then determine a first group of derivative phrases semantically associated with the first phrase. The derivative phrases may be determined based on semantic vectors between the first phrase and candidate phrases. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared with the first phrase based on the semantic vectors. The semantic vector is an algebraic model for representing text documents (e.g., a phrase) as vectors. Generally, the difference between two semantic vectors associated with phrases may be represented by a cosine distance between the two semantic vectors. In some embodiments, the first group of derivative phrases may include synonyms of the first phrase. As shown in FIG. 1, communication interface 102 may be further configured to retrieve from a phrase database 122 synonyms corresponding to the phrases. For example, the first phrase “there is” may have several synonyms, such as “there are” , “here is” , “exist” , “have” , “has” , or the like. In some embodiments, processor 104 may further determine a second group of derivative phrases semantically associated with the second phrase. The second  group of derivative phrases associated with the second phrase “over there” may have synonyms, such as “there” , “over here” , or the like.
Given the first group of derivative phrases, score determination unit 112 may replace the first phrase with the respective derivative phrases in the first group, and determine a first evaluation score based on the modified sentences. For example, the sentence “The man/over/there is/watching TV” may be modified by replacing the first phrase “there is” with the synonyms of “there are” , “here is” , “exist” , “have” , “has” , or the like. Therefore, a plurality of modified sentences may be generated, including “The man/over/there are/watching TV” , “The man/over/here is/watching TV” , “The man/over/exist/watching TV” , “The man/over/have/watching TV” , “The man/over/has/watching TV” , or the like. In some embodiments, a language model score may be determined for each of the modified sentences using a language model, and the language model scores may be averaged to derive the first evaluation score. The language model can evaluate a segmentation path according to natural language rules. In some embodiments, the modified sentences “The man/over/there are/watching TV” , “The man/over/here is/watching TV” , “The man/over/exist/watching TV” , “The man/over/have/watching TV” , “The man/over/has/watching TV” may be respectively evaluated as 47 points, 68 points, 35 points, 33 points, and 42 points. Accordingly, the first evaluation score may be an average of the above scores, i.e., 45 points. As an example, for the modified sentence “The man/over/there are/watching TV” , the language model may determine that a singular subject going with a plural verb does not comply with the natural language rules, and therefore evaluate the modified sentence with a low score. It is contemplated that, whether a low score indicates a segmentation path as being improper or proper is not restrictive.
The original sentence including the first phrase may also be evaluated by the language model to generate a first base score. For example, the original sentence “The man/over/there is/watching TV” may be evaluated as 68 points, which is considered as the first base score.
Similarly, score determination unit 112 may replace the second phrase (e.g., “over there” ) with the respective derivative phrases in the second group, and determine a second evaluation score based on the modified sentences. For example, a second evaluation score may be determined as 67 points and a second base score may be determined as 69 points.
The language model may be trained for a designated language, such as English, Chinese, Japanese, or the like. For sentences in Example I described above, an English language model may be used. The language model may be a general model pre-stored in memory 116, or trained for a specific area (e.g., novel, legal, navigation, or the like) by, for example, machine learning.
It contemplated that, when a phrase of a sentence within a proper segmentation path is modified with a synonym of the phrase, the modified sentence should still be able to deliver similar meanings and follow natural language rules. That is, when the language model evaluates the segmentation paths based on the original sentence and modified sentence, the language model scores based on the original sentence and the modified sentence should be close. On the other hand, when a phrase within an improper segmentation path is replaced with a synonym of the phrase, the modified sentences may have drastically different meanings or even does not meet natural language rules. That is, a difference between the language model scores based on the original sentence and the modified sentence may be magnified, so that processor 104 may determine the segmentation path is not proper. In some embodiments, a threshold may  be pre-determined, such as 5 points. When the difference is greater than or equal to the threshold, the corresponding segmentation path may be determined as an improper segmentation path. It is contemplated that, the pre-determined difference may be different for different language models.
Segmentation unit 114 may further segment the sentence based on the evaluation score (e.g., the first and/or second evaluation score) . For example, according to the first segmentation path, a difference between the first base score (e.g., 68 points) and the first evaluation score (e.g., 45 points) may be determined. The difference is 68-45=23 points, which is greater than the threshold (e.g., 5 points) . Therefore, processor 104 may determine that the first segmentation path is an improper segmentation path. On the other hand, according to the second segmentation path, a difference between the second base score (e.g., 69 points) and the second evaluation score (e.g., 67 points) may be determined. The difference is 69-67=2 points, which is less than the threshold (e.g., 5 points) . Therefore, segmentation unit 114 may determine that the second segmentation path is a proper segmentation path, and segment the sentence according to the second segmentation path.
As discuss above, the first base score (e.g., 68) corresponding to the first segmentation path and the second base score (e.g., 69) corresponding to the second segmentation path may be close, and selecting one segmentation path over another based on these base scores may sometimes cause errors. That is, a base score of 69 does not necessarily indicate that the corresponding segmentation path is better than one that has a base score of 68. However, by replacing the identified phrases in the first and second segmentation paths with synonyms, the difference between the first and second segmentation paths may be magnified so that segmentation unit 114 may more accurately select one of the segmentation paths based on the evaluation scores associated with respective segmentation paths.
Consistent with embodiments of the disclosure, segmentation unit 114 may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, as described above, the first and second evaluation scores are 45 points and 67 points, respectively. Therefore, segmentation unit 114 may select one of the first and second segmentation paths that has the higher evaluation score (e.g., the second segmentation path) , and segment the sentence according to the selected segmentation path.
It is contemplated that, all the scores and threshold described above are merely illustrative, and may be modified if necessary.
Besides Example I in English, system 100 may also process a sentence of text in another language, such as Chinese. FIG. 2 further illustrates two exemplary segmentation paths of a Chinese sentence, according to some embodiments of the disclosure. The Chinese sentence means that Associazione Calcio Fiorentina has the ability to compete with Juventus Football Club in Italian Serie A League. Pin’ yin spelling is marked for each Chinese character under the original Chinese sentence.
As shown in FIG. 2, in Example II, the Chinese sentence may be segmented according to first and second segmentation paths. The first and second segmentation paths may be compared by tokenizer 106 to identify that segment 202 and segment 204 are different. Then phrase identifier 108 may identify the phrase corresponding to “Yi Jia” in first segmentation path as a first phrase associated with the first segmentation path, and the phrase “Yi Jia” means “Italian Serie A League” in Chinese. Phrase identifier 108 may also identify the phrase corresponding to “Zai Yi” in the second segmentation path, and the phrase “Zai Yi” means “care for” , “mind” , or the like. It can be noticed that, the phrase “Yi Jia” and the phrase “Zai Yi” have a same Chinese character “Yi” in common.
Derivative phrase generator 110 may determine a first group and a second group of derivative phrases semantically associated with the first phrase and the second phrase, respectively. For example, the derivative phrases semantically associated with the first phrase “Yi Jia” may include “Primera Divisiòn de
Figure PCTCN2017095305-appb-000001
” “Premie League, ” “German Bundesliga, ” or the like. And the derivative phrases semantically associated with the second phrase “Zai Yi” may include “care for, ” “mind, ” “care about, ” or the like.
By replacing the first and second phrases with the corresponding derivative phrases respectively, modified sentences may be generated. And score determination unit 112 may then evaluate the first and second segmentation paths based on the original sentence and the modified sentences using a language model for Chinese. For example, the first and second segmentation paths based on the original sentence may be scored as 58.342 and 59.081 respectively, and the first and second segmentation paths based on the modified sentences may be scored as 58.561 and 34.952.
Based on the evaluation scores, segmentation unit 114 may select the first segmentation path having a higher evaluation score (e.g., 58.561 as opposed to 34.952) , and segment the sentence accordingly. Note in this example, selecting the segmentation path based on the base scores would have resulted in an error, as the second segmentation path actually has a slightly higher base score.
The above-described system 100 may magnify (and sometimes, reverse) the difference between two or more segmentation paths by replacing an identified phrase with synonyms and select a proper path for segmenting the sentence, when the language model cannot distinguish them.
Another aspect of the disclosure is directed to a method for segmenting a sentence. FIG. 3 is a flowchart of an exemplary method 300 for segmenting a sentence, according to some embodiments of the disclosure. For example, method 300 may be implemented by a segmentation device, and may include steps S302-S308.
In step S302, the segmentation device may identify a first phrase in the sentence associated with a first segmentation path. In some embodiments, the segmentation device may generate at least two segmentation paths of the sentence. Each of the segmentation path may include a plurality of segments. The segmentation device may compare the plurality of segments and identify a segment in a first segmentation path that is different from a corresponding segment in other segmentation paths. Similarly, the segmentation device may identify a segment in a second segmentation path. Further, the segmentation device may identify the first phrase in the sentence associated with the first segmentation path, and the second phrase in the sentence associated with the second segmentation path. As the first and second segmentation paths are different segmentation paths on a same sentence or a same part of the sentence, the first and second phrases may have at least one overlapping word.
In step S304, the segmentation device may determine a first group of derivative phrases semantically associated with the first phrase. The derivative phrases may be determined based on semantic vectors between the first phrase and candidate phrases. The candidate phrases may be pre-stored in a phrases database, and each phrase in the database may be compared with the first phrase based on the semantic vectors. In some embodiments, the first group of derivative phrases may include synonyms of the first phrase. Similarly, the segmentation device may determine a second group of derivative phrases semantically associated with the second phrase.
In step S306, the segmentation device may determine a first evaluation score based on the modified sentences generated by replacing the first phrase with the respective derivative phrases in the first group. A language model score may be determined for each of the modified sentences using a language model, and the language model scores may be averaged to derive the first evaluation score. The language model may evaluate a segmentation path according to natural language rules. The segmentation device may further determine a second evaluation score based on the modified sentences generated by replacing the second phrase with the respective derivative phrases in the second group. A first base score and a second base score may be also determined based the sentence. For example, the first and second segments of the sentence may be scored by the language model to generate the first and second base scores. It is contemplated that, an averaged score is merely an example for evaluating the segmentation paths. The individual scores may be manipulated or combined in any suitable ways to derive the evaluation score. For example, instead of a straight average of the individual scores, the evaluation score may be a weighted average of the individual scores, and the weights may correspond to how close the respective synonyms are to the phrase. As another example, a variance of the individual language model scores for the modified sentences may be used to determine whether the language model scores vary significantly. If the language model scores vary significantly, it may indicate the corresponding segmentation path is not proper.
In step S308, the segmentation device may segment the sentence based on the first evaluation score. In one embodiment, the segmentation device may compare the first evaluation score with the first base score, and segment the sentence according to the first segmentation path when the difference between the first base score and the first evaluation score is less than a threshold. As discussed above, when a sentence is segmented properly, the evaluation score  based on modified sentences should be close to the base score based on the original sentence. In another embodiment, the segmentation device may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, the segmentation device may select one of the first and second segmentation paths that has the higher evaluation score, and segment the sentence according to the selected segmentation path.
Yet another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed segmentation system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods. Although the embodiments are described for two segmentation paths as an example, the described segmentation system and method can be applied to more than two segmentation paths.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims (20)

  1. A computer-implemented method for segmenting a sentence, comprising:
    identifying, by a processor, a first phrase in the sentence associated with a first segmentation path;
    determining, by the processor, a first group of derivative phrases semantically associated with the first phrase;
    determining, by the processor, a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and
    segmenting the sentence based on the first evaluation score.
  2. The method of claim 1, wherein determining the first evaluation score includes:
    determining a language model score for each modified sentence; and
    averaging the language model scores to derive the first evaluation score.
  3. The method of claim 2, wherein the language model score is generated using a language model according to natural language rules.
  4. The method of claim 1, further comprising:
    determining a first base score based on the sentence including the first phrase; and
    segmenting the sentence based on a difference between the first base score and the first evaluation score.
  5. The method of claim 4, wherein the sentence is segmented according to the first segmentation path when the difference between the first base score and the first evaluation score is less than a threshold.
  6. The method of claim 1, further comprising:
    identifying a second phrase in the sentence associated with a second segmentation path;
    determining a second group of derivative phrases semantically associated with the second phrase;
    determining a second evaluation score based on score based on modified sentences generated by replacing the second phrase with the respective derivative phrase in the second group; and
    segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.
  7. The method of claim 6, wherein the second phrase has at least one overlapping word with the first phrase.
  8. The method claim 6, further comprising:
    identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.
  9. The method claim 6, further comprising:
    selecting one of the first and second segmentation paths that has the higher evaluation score; and
    segmenting the sentence according to the selected segmentation path.
  10. The method of claim 1, wherein the derivative phrases are determined based on semantic vectors between the first phrase and candidate phrases.
  11. A system for segmenting a sentence, comprising:
    a communication interface configured for receiving the sentence;
    a memory configured for storing the sentence and a language model; and
    a processor configured for
    identifying a first phrase in the sentence associated with a first segmentation path;
    determining a first group of derivative phrases semantically associated with the first phrase;
    determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and
    segmenting the sentence based on the first evaluation score.
  12. The system of claim 11, wherein the processor is further configured for:
    determining a language model score for each modified sentence; and
    averaging the language model scores to derive the first evaluation score.
  13. The system of claim 12, wherein the language model score is generated using a language model according to natural language rules.
  14. The system of claim 11, wherein the processor is further configured for:
    determining a first base score based on the sentence including the first phrase; and
    segmenting the sentence based on a difference between the first base score and the first evaluation score.
  15. The system of claim 14, wherein the sentence is segmented according to the first segmentation path when the difference between the first base score and the first evaluation score is less than a threshold.
  16. The system of claim 11, wherein the processor is further configured for:
    identifying a second phrase in the sentence associated with a second segmentation path;
    determining a second group of derivative phrases semantically associated with the second phrase;
    determining a second evaluation score based on score based on modified sentences generated by replacing the second phrase with the respective derivative phrase in the second group; and
    segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.
  17. The system of claim 16, wherein the second phrase has at least one overlapping word with the first phrase.
  18. The system claim 16, wherein the processor is further configured for:
    identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.
  19. The system claim 16, wherein the processor is further configured for:
    selecting one of the first and second segmentation paths that has the higher evaluation score; and
    segmenting the sentence according to the selected segmentation path.
  20. A non-transitory computer-readable medium that stores a set of instructions, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method comprising:
    identifying a first phrase in the sentence associated with a first segmentation path;
    determining a first group of derivative phrases semantically associated with the first phrase;
    determining a first evaluation score based on modified sentences generated by replacing the first phrase with the respective derivative phrase in the first group; and
    segmenting the sentence based on the first evaluation score.
PCT/CN2017/095305 2017-07-31 2017-07-31 System and method for segmenting a sentence WO2019023893A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP17920149.6A EP3642733A4 (en) 2017-07-31 2017-07-31 System and method for segmenting a sentence
CN201780093452.0A CN110945514B (en) 2017-07-31 2017-07-31 System and method for segmenting sentences
PCT/CN2017/095305 WO2019023893A1 (en) 2017-07-31 2017-07-31 System and method for segmenting a sentence
TW107125631A TWI676167B (en) 2017-07-31 2018-07-25 System and method for segmenting a sentence and relevant non-transitory computer-readable medium
US16/749,956 US11132506B2 (en) 2017-07-31 2020-01-22 System and method for segmenting a sentence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/095305 WO2019023893A1 (en) 2017-07-31 2017-07-31 System and method for segmenting a sentence

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/749,956 Continuation US11132506B2 (en) 2017-07-31 2020-01-22 System and method for segmenting a sentence

Publications (1)

Publication Number Publication Date
WO2019023893A1 true WO2019023893A1 (en) 2019-02-07

Family

ID=65232340

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/095305 WO2019023893A1 (en) 2017-07-31 2017-07-31 System and method for segmenting a sentence

Country Status (5)

Country Link
US (1) US11132506B2 (en)
EP (1) EP3642733A4 (en)
CN (1) CN110945514B (en)
TW (1) TWI676167B (en)
WO (1) WO2019023893A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3751445A4 (en) * 2019-04-26 2021-03-10 Wangsu Science & Technology Co., Ltd. Text labeling method and device based on teacher forcing
US11132506B2 (en) * 2017-07-31 2021-09-28 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for segmenting a sentence
US11544466B2 (en) 2020-03-02 2023-01-03 International Business Machines Corporation Optimized document score system using sentence structure analysis function

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080091407A1 (en) * 2006-09-28 2008-04-17 Kentaro Furihata Apparatus performing translation process from inputted speech
US20120078612A1 (en) * 2010-09-29 2012-03-29 Rhonda Enterprises, Llc Systems and methods for navigating electronic texts
US20140244240A1 (en) * 2013-02-27 2014-08-28 Hewlett-Packard Development Company, L.P. Determining Explanatoriness of a Segment
CN104464757A (en) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 Voice evaluation method and device
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006018354A (en) * 2004-06-30 2006-01-19 Advanced Telecommunication Research Institute International Text division device and natural language processor
US7680648B2 (en) * 2004-09-30 2010-03-16 Google Inc. Methods and systems for improving text segmentation
JP4073459B2 (en) * 2006-04-17 2008-04-09 光芳 塚原 Sentence analyzer
JP2008021093A (en) * 2006-07-12 2008-01-31 National Institute Of Information & Communication Technology Sentence conversion processing system, translation processing system having sentence conversion function, voice recognition processing system having sentence conversion function, and speech synthesis processing system having sentence conversion function
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
JP5630138B2 (en) * 2010-08-12 2014-11-26 富士ゼロックス株式会社 Sentence creation program and sentence creation apparatus
CN102479191B (en) * 2010-11-22 2014-03-26 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
US9020806B2 (en) * 2012-11-30 2015-04-28 Microsoft Technology Licensing, Llc Generating sentence completion questions
CN103035243B (en) * 2012-12-18 2014-12-24 中国科学院自动化研究所 Real-time feedback method and system of long voice continuous recognition and recognition result
TWI501097B (en) 2012-12-22 2015-09-21 Ind Tech Res Inst System and method of analyzing text stream message
CN103544309B (en) * 2013-11-04 2017-03-15 北京中搜网络技术股份有限公司 A kind of retrieval string method for splitting of Chinese vertical search
CN103793491B (en) * 2014-01-20 2017-01-25 天津大学 Chinese news story segmentation method based on flexible semantic similarity measurement
CN103927358B (en) * 2014-04-15 2017-02-15 清华大学 text search method and system
JP6482073B2 (en) * 2015-06-08 2019-03-13 日本電信電話株式会社 Information processing method, apparatus, and program
TW201715420A (en) * 2015-10-30 2017-05-01 元智大學 Method and system for analysing the weight of text
WO2018039644A1 (en) * 2016-08-25 2018-03-01 Purdue Research Foundation System and method for controlling a self-guided vehicle
US11087068B2 (en) * 2016-10-31 2021-08-10 Fujifilm Business Innovation Corp. Systems and methods for bringing document interactions into the online conversation stream
EP3642733A4 (en) * 2017-07-31 2020-07-22 Beijing Didi Infinity Technology and Development Co., Ltd. System and method for segmenting a sentence
CN110998589B (en) * 2017-07-31 2023-06-27 北京嘀嘀无限科技发展有限公司 System and method for segmenting text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080091407A1 (en) * 2006-09-28 2008-04-17 Kentaro Furihata Apparatus performing translation process from inputted speech
US20120078612A1 (en) * 2010-09-29 2012-03-29 Rhonda Enterprises, Llc Systems and methods for navigating electronic texts
US20140244240A1 (en) * 2013-02-27 2014-08-28 Hewlett-Packard Development Company, L.P. Determining Explanatoriness of a Segment
CN104464757A (en) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 Voice evaluation method and device
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11132506B2 (en) * 2017-07-31 2021-09-28 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for segmenting a sentence
EP3751445A4 (en) * 2019-04-26 2021-03-10 Wangsu Science & Technology Co., Ltd. Text labeling method and device based on teacher forcing
US11544466B2 (en) 2020-03-02 2023-01-03 International Business Machines Corporation Optimized document score system using sentence structure analysis function

Also Published As

Publication number Publication date
US11132506B2 (en) 2021-09-28
TWI676167B (en) 2019-11-01
EP3642733A1 (en) 2020-04-29
EP3642733A4 (en) 2020-07-22
US20200160000A1 (en) 2020-05-21
CN110945514B (en) 2023-08-25
CN110945514A (en) 2020-03-31
TW201911289A (en) 2019-03-16

Similar Documents

Publication Publication Date Title
US11132506B2 (en) System and method for segmenting a sentence
US9864744B2 (en) Mining multi-lingual data
CN107247707B (en) Enterprise association relation information extraction method and device based on completion strategy
US20200242144A1 (en) Display control system and storage medium
Pedler Computer correction of real-word spelling errors in dyslexic text
US11282521B2 (en) Dialog system and dialog method
US20220083577A1 (en) Information processing apparatus, method and non-transitory computer readable medium
CN106610990B (en) Method and device for analyzing emotional tendency
CN109614623B (en) Composition processing method and system based on syntactic analysis
BE1027696A1 (en) ANALYSIS AND COMPARISON OF SIGN-CODED DIGITAL DATA, IN PARTICULAR FOR JOB MATCHING
EP2833269B1 (en) Terminology verification system and method for machine translation services for domain-specific texts
Castillo-López et al. Analyzing zero-shot transfer scenarios across spanish variants for hate speech detection
Rozovskaya et al. The columbia system in the qalb-2014 shared task on arabic error correction
CN109002454B (en) Method and electronic equipment for determining spelling partition of target word
US9665562B1 (en) Automatic cognate detection in a computer-assisted language learning system
TWI713870B (en) System and method for segmenting a text
Yan et al. Word-based domain adaptation for neural machine translation
Grissom II et al. Incremental prediction of sentence-final verbs: Humans versus machines
US10803242B2 (en) Correction of misspellings in QA system
Tran et al. Resolving named entity unknown word in Chinese-Vietnamese machine translation
Nguyen-Son et al. Identifying adversarial sentences by analyzing text complexity
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
Banerjee Domain adaptation for statistical machine translation of corporate and user-generated content
KR101664278B1 (en) The method and apparatus for analyzing sentence based on semantic role labeling using hybrid method
Allauzen et al. Voice query refinement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17920149

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017920149

Country of ref document: EP

Effective date: 20200123

NENP Non-entry into the national phase

Ref country code: DE