CN107315732B - Chinese English discovering method and system - Google Patents

Chinese English discovering method and system Download PDF

Info

Publication number
CN107315732B
CN107315732B CN201610281264.8A CN201610281264A CN107315732B CN 107315732 B CN107315732 B CN 107315732B CN 201610281264 A CN201610281264 A CN 201610281264A CN 107315732 B CN107315732 B CN 107315732B
Authority
CN
China
Prior art keywords
word
english
collocation
collocations
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610281264.8A
Other languages
Chinese (zh)
Other versions
CN107315732A (en
Inventor
盛志超
张凯波
陈志刚
魏思
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201610281264.8A priority Critical patent/CN107315732B/en
Publication of CN107315732A publication Critical patent/CN107315732A/en
Application granted granted Critical
Publication of CN107315732B publication Critical patent/CN107315732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a method and a system for discovering Chinese English, wherein the method comprises the following steps: acquiring an English sentence to be detected; obtaining the subject information of each word in the English sentence to be detected; generating collocation words in the English sentence to be detected based on the subject information of each word; determining whether Chinese English collocations exist in the collocations; if so, determining that the English sentence to be detected contains Chinese English; and if not, determining that the English sentence to be detected does not contain Chinese English. According to the method, whether the English sentence to be detected contains Chinese English matching words is judged based on the matching words, so that the accuracy of judging whether the English sentence to be detected contains Chinese English can be improved.

Description

Chinese English discovering method and system
Technical Field
The invention relates to the technical field of natural language understanding and text processing, in particular to a method and a system for discovering Chinese English.
Background
In China, English is taken as a basic teaching subject and runs through every career related to English users. With the continuous development of education informatization technology, the traditional teaching mode is changed greatly, and various intelligent teaching and learning systems are gradually applied to daily teaching, such as automatic correction, learning situation diagnosis and the like. The automatic correction of English compositions has become an important component in intelligent teaching, and Chinese English is a common error in English compositions, and the detection accuracy directly influences the correction result of the whole composition, so that the discovery of Chinese English is always a hotspot of research of related field personnel.
Most of the existing Chinese english discoveries are resource-based methods, as shown in fig. 1, which mainly include: collecting network resources in advance, and constructing a Chinese English set; and then judging whether the Chinese text contains Chinese English in a Chinese English set constructed in advance. The reliability and accuracy of the existing method completely depend on whether a pre-constructed Chinese English set covers all possible Chinese English, and in practical application, due to the lack of resources and the fact that Chinese English errors are different and varied from person to person, the method is impossible to construct a Chinese English set covering all possible Chinese English sets, so that the existing Chinese English finding method is poor in effect.
Disclosure of Invention
The embodiment of the invention provides a method and a system for discovering Chinese English, which aim to solve the problem of low accuracy of a method for discovering Chinese English based on resources in the prior art.
Therefore, the embodiment of the invention provides the following technical scheme:
a method for discovering Chinese English comprises the following steps:
acquiring an English sentence to be detected;
obtaining the subject information of each word in the English sentence to be detected;
generating collocation words in the English sentence to be detected based on the subject information of each word;
determining whether Chinese English collocations exist in the collocations;
if so, determining that the English sentence to be detected contains Chinese English;
and if not, determining that the English sentence to be detected does not contain Chinese English.
Preferably, the method further comprises: a theme extraction model is constructed in advance;
the obtaining of the theme information of each word in the to-be-detected English sentence comprises:
and obtaining the theme information of each word in the English sentence to be detected based on the theme extraction model.
Preferably, constructing the topic extraction model comprises:
collecting natural English corpus, and labeling the subject of each word in the natural English corpus;
and training according to the natural English corpus and the theme marking information thereof to obtain a theme extraction model.
Preferably, the method further comprises: constructing a collocated word quality judgment model in advance;
the determining whether the Chinese English collocations exist in the collocations comprises the following steps:
determining the quality of each collocated word based on the collocated word quality judgment model;
and if the collocation word is a high-quality collocation word and no collocation word matched with the collocation word is found in the pre-constructed collocation word library, determining that the collocation word is a Chinese English collocation word.
Preferably, the constructing a collocated word quality judgment model includes:
collecting natural English corpus, and labeling the subject of each word in the natural English corpus;
generating collocation words in the natural English corpus based on the topic marking information of each word;
extracting collocation word characteristics and marking collocation word quality, wherein the collocation word characteristics comprise any one or more of the following characteristics: co-occurrence frequency of the collocations, point mutual information among different words in the collocations, reverse document frequency of each word in the collocations, number of stop words in the collocations, and frequency ratio of co-occurrence frequency of the current collocations to the sub-collocations with the minimum frequency;
and training according to the matched word characteristics and the quality labeling information to obtain a matched word quality judgment model.
Preferably, the method further comprises:
displaying English sentences containing Chinese English and/or Chinese English collocations in a visual and/or auditory mode; and/or
And if the English sentence to be detected contains Chinese English, prompting in a visual and/or auditory mode.
A chinese english discovery system comprising:
the sentence acquisition module is used for acquiring English sentences to be detected;
the theme acquisition module is used for acquiring theme information of each word in the English sentence to be detected;
the matched word generating module is used for generating matched words in the English sentence to be detected based on the theme information of each word;
the determining module is used for determining whether the Chinese English collocations exist in the collocations; if so, determining that the English sentence to be detected contains Chinese English; and if not, determining that the English sentence to be detected does not contain Chinese English.
Preferably, the system further comprises:
the first model building module is used for building a theme extraction model in advance;
the theme obtaining module is specifically configured to obtain theme information of each word in the to-be-detected english sentence based on the theme extraction model.
Preferably, the first model building module comprises:
the first corpus collection unit is used for collecting natural English corpuses;
the first theme labeling unit is used for labeling the theme of each word in the natural English corpus;
and the first model training unit is used for training according to the natural English corpus and the theme marking information thereof to obtain a theme extraction model.
Preferably, the system further comprises:
the second model building module is used for building a collocated word quality judgment model in advance;
the determining module is specifically used for determining the quality of each collocated word based on the collocated word quality judging model; and if the collocation word is a high-quality collocation word and no collocation word matched with the collocation word is found in the pre-constructed collocation word library, determining that the collocation word is a Chinese English collocation word.
Preferably, the second model building module comprises:
the second corpus collecting unit is used for collecting natural English corpuses;
the second theme labeling unit is used for labeling the theme of each word in the natural English corpus;
the generating unit is used for generating collocation words in the natural English corpus based on the subject marking information of each word;
the feature extraction unit is used for extracting collocation word features, and the collocation word features comprise any one or more of the following: co-occurrence frequency of the collocations, point mutual information among different words in the collocations, reverse document frequency of each word in the collocations, number of stop words in the collocations, and frequency ratio of co-occurrence frequency of the current collocations to the sub-collocations with the minimum frequency;
the quality labeling unit is used for labeling the quality of the collocated words;
and the second model training unit is used for training according to the matched word characteristics and the quality marking information to obtain a matched word quality judgment model.
Preferably, the system further comprises:
the display module is used for displaying English sentences containing Chinese English and/or Chinese English collocations in a visual and/or auditory mode; and/or
And the prompting module is used for prompting in a visual and/or auditory mode if the English sentence to be detected contains Chinese English.
According to the Chinese English discovering method and system provided by the embodiment of the invention, the topic information of each word in the English sentence to be detected is obtained, then the collocated word of the English sentence to be detected is obtained based on the topic information of each word, whether each collocated word is a Chinese English collocated word is judged, and whether the English sentence to be detected contains Chinese English is judged according to whether the English sentence to be detected contains the Chinese English collocated word. The Chinese English set is difficult to collect because Chinese English errors cannot be exhausted in the traditional Chinese English set, and the method judges whether the English sentence to be detected contains Chinese English based on the collocation words, and the number of the collocation words in natural English is limited, so that a corpus of natural English collocation words can be constructed in advance, the collocation words in the English sentence to be detected are matched in the corpus to judge whether each collocation word is a Chinese English collocation word, and the accuracy of judging whether the English sentence to be detected contains Chinese English is improved.
Furthermore, the invention divides the collocation words into two types of high quality and low quality, and then judges whether the high quality collocation words are Chinese English collocation words or not, wherein the high quality collocation words refer to common word combinations or common word collocation forms, and the low quality collocation words refer to other collocation words, so that the problem that the low quality collocation words are mistakenly judged as Chinese English due to various word collocation forms of English sentences or English sentences can be further solved, and the accuracy of Chinese English discovery is improved.
Furthermore, the method can obtain the topic information of each word in the English sentence to be detected based on the pre-constructed topic extraction model, and is simple, efficient and high in accuracy.
Furthermore, the method can judge the quality of each collocation word based on the pre-constructed collocation word quality judgment model, and is simple, efficient and high in accuracy.
Further, the present invention may determine the quality of each collocated word by using the collocated word quality determination model, and the collocated word characteristics may include any one or more of the following: the co-occurrence frequency of the collocations, the point mutual information between different words in the collocations, the reverse document frequency of each word in the collocations, the number of stop words in the collocations, and the frequency ratio of the co-occurrence frequency of the current collocations to the frequency of the sub-collocations with the minimum frequency. The quality of each collocation word is judged from multiple angles through multiple characteristics, and the accuracy of judging the quality of each collocation word can be effectively improved.
Furthermore, the invention considers the problems of labor input and efficiency, the training collocated word selection principle covers various collocation types, such as dynamic + name, adjective + noun, and the like, and preferably selects collocated words with higher word frequency in the same type for collocated word quality labeling. Therefore, the manual input amount can be effectively reduced, and the efficiency is improved.
Furthermore, the pre-constructed corpus provided by the invention is a correct corpus, and the corpus is easier to construct due to the limited number of matched words in natural English, for example, matched words appearing in English-Chinese dictionary or common action + nouns, adjective + noun matching and the like can be imported into the correct corpus.
Furthermore, the invention can display English sentences containing Chinese English and/or Chinese English collocations in a visual and/or auditory mode; and/or if the English sentences to be detected contain Chinese English, prompting in a visual and/or audible form, and giving some display modes so that reviewers and/or authors find errors and correct the errors.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a prior art method of discovering Chinese English;
fig. 2 is a flowchart of a method for discovering chinese english according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for constructing a topic extraction model according to an embodiment of the present invention;
fig. 4 is a flowchart of a method for determining whether there is a chinese english collocations in the collocations according to an embodiment of the present invention;
fig. 5 is a flowchart of a method for constructing a collocated word quality determination model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a chinese english discovery system according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a first model building module according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a second model building module provided in the embodiments of the present invention;
fig. 9 is another schematic structural diagram of a chinese english discovery system according to an embodiment of the present invention.
Detailed Description
In order to make the technical field better understand the scheme of the embodiment of the invention, the invention is further described in detail with reference to the attached drawings and the embodiment. The following examples are illustrative only and are not to be construed as limiting the invention.
According to the Chinese English discovering method and system, the topics of the English sentences to be detected are extracted, the collocations are generated according to the topics in the sentences, and then whether the English sentences to be detected contain Chinese English or not is judged based on the collocations, wherein the collocations refer to word combinations with the same topics in the same sentences. The traditional Chinese English collection cannot be exhausted due to Chinese English mistakes, so that the Chinese English collection is difficult to collect; the invention judges whether the English sentence to be detected contains Chinese English or not based on the collocation words, and the number of the collocation words in the natural English is limited, so that an English set can be constructed to cover all the collocation words in the natural English, and the accuracy of Chinese English finding can be effectively improved. In addition, the collocation words can be classified, for example, the collocation words are divided into two types of high quality and low quality, and then whether the high quality collocation words are Chinese English collocation words or not is judged, wherein the high quality collocation words can refer to common word combinations or common word collocation forms, and the low quality collocation words can refer to other collocation words (such as uncommon collocation words), so that the problem that the low quality collocation words are misjudged as Chinese English due to various word collocation forms of English sentences or English sentences can be further solved.
In order to better understand the technical solutions and effects of the present invention, the following detailed descriptions will be made with reference to the flowcharts and specific embodiments.
Fig. 2 is a flowchart of a method for discovering chinese english according to an embodiment of the present invention, which includes the following steps:
and step S01, acquiring English sentences to be detected.
In this embodiment, the english sentence to be detected may be an english sentence input by the user in a text form, or an english speech input by the user, and the english speech is converted into the text form by methods such as speech Recognition, or may be image information obtained by the user, and the english sentence is obtained by technologies such as Optical Character Recognition (OCR), which is not limited herein.
In one embodiment, the answers of the examinee in the english test are scanned, images of the answers of the examinee in the english test are obtained, and then english sentences in the answers are obtained through an OCR technology.
And step S02, obtaining the subject information of each word in the English sentence to be detected.
In this embodiment, the theme information is natural language statistical information. When the natural corpus is expressed in a word set form, the dimensionality of the natural corpus is very high, and the purpose of reducing dimensionality can be achieved by expressing the natural corpus through a theme; in addition, the implicit relation in the words can be better mined through the topic information. In practical application, K topics may be assumed, each topic of the topics has a word set capable of expressing the topic in the cognitive range, and the K topics are mixed together to form various documents and the like. The theme number can be obtained by a large number of experiments, and the index of the theme number reaching the optimal value is that the collocation obtained in the theme number can cover all the collocations in the natural corpus; the number of subjects is also determined by experience, and the number of subjects is generally set to 50.
In practical application, the topic information of each word in the english sentence to be detected can be obtained through a topic extraction model, and specifically, the topic extraction model may be: document topic Generation model (LDA).
And step S03, generating collocation words in the English sentence to be detected based on the subject information of each word.
In this embodiment, the words with the same subject information in the to-be-detected english sentence constitute a collocated word. In practical application, collocations are generated according to the sequence of words belonging to the same theme in a sentence, and each collocations at least comprises two words.
Step S04, determining whether the collocations have Chinese English collocations; if so, determining that the English sentence to be detected contains Chinese English; and if not, determining that the English sentence to be detected does not contain Chinese English.
In this embodiment, whether the collocations are chinese english collocations may be determined based on a pre-constructed natural english collocations word stock. Specifically, a natural english collocation word library may be constructed in advance, and if a current collocation word exists in the natural english collocation word library, it is determined that the current collocation word is not a chinese english collocation word, and if the current collocation word does not exist in the natural english collocation word library, it is determined that the current collocation word is a chinese english collocation word. In addition, a chinese english collocation word bank may also be constructed in advance, and if the current collocation word exists in the chinese english collocation word bank, it is determined that the current collocation word is chinese english, and if the current collocation word does not exist in the chinese english collocation word bank, it is determined that the current collocation word is not chinese english. Of course, two or more matching word banks can be constructed simultaneously, for example: the method comprises the steps of obtaining a natural English matching word library and a Chinese English matching word library, if the current matching words do not exist in the natural English matching word library, continuing to match in the Chinese English matching word library, if a matching result exists, determining that the current matching words are Chinese English matching words, and if the matching result does not exist, determining that the current matching words are not Chinese English matching words, so that the accuracy of Chinese English finding can be further improved. The above is only an example of determining whether the Chinese-english collocation word exists in the collocation words through the pre-constructed collocation word library, and other forms of setting, use order or collocation combination may also exist, and the above example cannot be understood as a limitation of the present invention.
Further, the method may further include: displaying English sentences containing Chinese English and/or Chinese English collocations in a visual and/or auditory mode; and/or prompting in a visual and/or auditory mode if the English sentence to be detected contains Chinese English; for example, voice broadcast: there are chinese english and the like. In practical application, the english sentence and/or the chinese english collocations including chinese english can be presented in any one or more of the following ways:
displaying English sentences containing Chinese English and/or Chinese English collocations by adopting colors which are highlighted or different from other displayed contents;
and displaying candidate correction English sentences and/or English collocations corresponding to the English sentences and/or the Chinese English collocations containing Chinese English, and marking the English sentences and/or the Chinese English collocations containing Chinese English.
And of course, the display can be performed in other display modes, and the display modes are not limited herein.
In the embodiment of the invention, the subject information of each word in the English sentence to be detected is obtained, the collocated word in the English sentence to be detected is generated based on the subject information of each word, then whether each collocated word is a Chinese English collocated word is judged, and finally whether the English sentence to be detected contains Chinese English is determined by judging whether the English sentence to be detected contains the Chinese English collocated word. The collocation words generated based on the theme information can represent the current theme, and the number of the natural English collocation words is limited, so that a collocation word bank covering all the natural English collocation words can be constructed, whether the current collocation words are Chinese English collocation words or not can be effectively judged by utilizing the collocation word bank, whether the English sentence to be detected contains Chinese English or not is finally determined by judging whether the English sentence to be detected contains the Chinese English collocation words or not, and the accuracy rate of finding Chinese English is improved.
In another embodiment, the obtaining the topic information of each word in the to-be-detected english sentence includes: and obtaining the theme information of each word in the English sentence to be detected based on the theme extraction model. In view of the above, the present invention also provides a method for constructing a topic extraction model, as shown in fig. 3, which is a flowchart of the method for constructing a topic extraction model. The method for constructing the theme extraction model comprises the following steps:
step S31, collecting natural English corpus, and labeling each word in the natural English corpus.
In this embodiment, the natural english corpus may be a natural english corpus collected from a network or an existing corpus, such as english novel, english thesis, english script, standard answer to english test questions, and the like. In addition, English corpora can be screened according to different requirements, such as American English, English and the like; and then, carrying out theme labeling on each word in the natural English corpus. It should be noted that the labeling information may be obtained by manually labeling the collected natural english corpus, or may be obtained by directly collecting the natural english corpus already having the subject labeling information, which is not limited herein.
And step S32, training according to the natural English corpus and the subject labeling information thereof to obtain a subject extraction model.
In this embodiment, the topic extraction model may be an LDA model, the input of the topic extraction model is an english sentence, the topic extraction model outputs topic information of each word in the english sentence, the natural english corpus collected in step S31 is input into the topic extraction model, and the result output by the topic extraction model approaches to the topic information of each word labeled in advance by adjusting model parameters, so as to train the topic extraction model.
In a specific embodiment, taking I like this skin turn very much machh as an example for explanation, the english statement labeled with topic information in advance may be: i:1like:2 this:3 mask: 3 very:2much:2, and the number after each word represents the subject to which the word belongs, namely the subject information of the word. The number of the topics of the topic extraction model is predetermined, the I like this sketch variant uch is input into the topic extraction model, the model parameters are adjusted, the result output by the topic extraction model approaches to I:1like:2 this:3 sketch: 3 ver: 2 uch:2, and the topic extraction model is trained through a large amount of natural English corpora to obtain the trained topic extraction model. Wherein, the collocation word has two, is respectively: like very mech and this skate.
The embodiment of the invention can obtain the topic information of each word in the English sentence to be detected based on the pre-constructed topic extraction model, and has the advantages of simplicity, high efficiency and high accuracy.
In other embodiments, after generating the collocation words in the english sentence to be detected, the present invention further determines the quality of each collocation word, and performs a distinguishing process on the collocation words with high quality and low quality to further improve the accuracy of detecting the chinese english collocation words, as shown in fig. 4, which is a flowchart for determining whether the chinese english collocation words exist in the collocation words provided by the present invention, includes:
and step S41, determining the quality of each collocated word based on the collocated word quality judgment model.
In this embodiment, the collocated word quality determination model may be a Support Vector Machine (SVM), or a classifier trained based on algorithms such as random forest, which is not limited herein. The input of the matching word quality judgment model is one-dimensional or multidimensional matching word characteristics, and the output is the quality of the matching words: high quality and low quality. The quality of the collocation word is characterized by whether the collocation word is a common collocation word or a common word collocation manner, such as: verb + noun, adjective + noun collocation manner, etc.
It should be noted that the quality of each collocated word may also be determined based on a rule or the like, for example, a corresponding threshold is set for each collocated word feature, and quality determination is performed according to the threshold, and the threshold may be determined by experience and a lot of experiments, which is not limited herein.
Step S42, if the collocations are high-quality collocations and there are no collocations matching with the high-quality collocations in the pre-constructed collocations library, determining that the collocations are chinese english collocations.
In this embodiment, the matching words are processed in a distinguishing manner through the step, so that the problem that low-quality matching words are misjudged to be Chinese English due to various word matching forms of English sentences or English sentences can be further solved, and the accuracy of Chinese English finding is improved.
It should be noted that the matching word library in this embodiment is a matching word library of natural english, i.e. a correct corpus, and since the number of correctly matched words is limited, it is easier to construct the corpus, for example, matching words or common action + nouns, adjective + noun matching and the like appearing in the english-chinese dictionary can be imported into the correct corpus. Of course, the matching word library of the Chinese English language may also be used, and the matching words matched in the library are used as the Chinese English matching words. Two or more matching word banks can be combined for use to improve the use effect, and the method is not limited herein.
In another embodiment, an embodiment of the present invention further provides a method for constructing a collocated word quality determination model, as shown in fig. 5, which is a flowchart of the method for constructing a collocated word quality determination model, and the method includes:
step S51, collecting natural English corpus, and labeling each word in the natural English corpus.
In this embodiment, the step may be the same as step S31, or the topic extraction model trained in step S32 may be used to label the topics of the words in the natural english corpus, which is not described in detail herein. Further, this step may be performed at the same time/different time as step S31 or directly call the result of step S31, which is not limited herein.
And step S52, generating collocation words in the natural English corpus based on the subject labeling information of each word.
In this embodiment, this step may be the same as step S03, and will not be described in detail here.
And step S53, extracting the matched word characteristics and marking the quality of the matched words.
In this embodiment, the collocations characteristics include, but are not limited to, any one or more of the following: the co-occurrence frequency of the collocations, the mutual point information between different words in the collocations, the reverse document frequency of each word in the collocations, the number of stop words in the collocations, and the frequency ratio of the co-occurrence frequency of the current collocations to the sub-collocations with the lowest frequency are described in detail below.
1) Co-occurrence frequency of collocated words
The co-occurrence frequency of the collocation words refers to the sum of the co-occurrence frequencies of all the sequences of the collocation words, if one collocation word is ABC, all the sequences are collocated with ABC, ACB, BAC, BCA, CAB and CBA, and the co-occurrence frequency of the collocation words ABC is the sum of the frequencies of the 6 collocation words appearing in the natural corpus.
2) PMI between different words in collocation words (Point mutual information)
The calculation formula of the point mutual information between two words is shown as formula (1):
Figure BDA0000976414800000111
wherein p (v) is the co-occurrence frequency of the collocations words v, p (u)i)、p(uj) Are respectively a word uiAnd word ujFrequency of occurrence in natural corpus.
3) IDF (inverse document frequency) of each of the collocation words
IDF is log (D/Dt), D is the number of english sentences in the natural corpus, and Dt is the number of sentences in which the current word appears.
4) Number of stop words in collocation words
The stop word can be judged by adopting the existing related technology, such as the simplest method, a stop word list is constructed in advance, and then whether the stop word is judged through matching.
5) The frequency ratio of the co-occurrence frequency of the current collocations to the frequency of the sub-collocations with the minimum frequency
The frequency of the sub-collocations with the lowest frequency specifically refers to the frequency of the sub-collocations with the lowest frequency in the natural corpus of the sub-collocations of the current collocations, and if the current collocations ABC exist, the sub-collocations have AB, AC and BC, and the frequency of the smallest frequency in the 3 sub-collocations is found.
After extracting the collocation word features, the quality labeling is performed on each collocation word, for example, by manual labeling. Generally, it is considered that matching words appearing in a dictionary (such as an english-chinese dictionary) or common matching words and nouns, adjectives and nouns are high-quality matching, whereas some matching words which are never seen or have too low frequency of appearance or contain little information are marked as low-quality matching words.
It should be noted that, for the matching words formed by the training corpus, considering the problems of labor input and efficiency, the matching words with higher word frequency are generally selected for labeling and feature extraction. In addition, in this embodiment, the collocations with the topic labeling information can also be directly collected.
And step S54, training according to the matched word characteristics and the quality marking information to obtain a matched word quality judgment model.
In this embodiment, the collocated word features are input into the collocated word quality judgment model, model output approaches to pre-labeled quality labeling information by adjusting model parameters, and the collocated word quality judgment model is obtained through a large amount of training.
In the embodiment of the invention, the quality of each collocated word can be judged by utilizing the collocated word characteristic judgment model, and the quality of each collocated word is judged from multiple angles through the multidimensional collocated word characteristics, so that the accuracy of judging the quality of each collocated word can be effectively improved.
Correspondingly, the present invention further provides a chinese english discovery system corresponding to the chinese english discovery method, as shown in fig. 6, which is a schematic structural diagram of the chinese english discovery system, and the system includes:
a sentence acquisition module 601, configured to acquire an english sentence to be detected;
a topic obtaining module 602, configured to obtain topic information of each word in the to-be-detected english sentence;
a collocated word generating module 603, configured to generate collocated words in the to-be-detected english sentence based on the subject information of each word;
a determining module 604, configured to determine whether a Chinese-english collocated word exists in the collocated word; if so, determining that the English sentence to be detected contains Chinese English; and if not, determining that the English sentence to be detected does not contain Chinese English.
Further, the system may further include: a first model construction module 706, configured to construct a topic extraction model in advance; the topic obtaining module 602 is specifically configured to obtain topic information of each word in the to-be-detected english sentence based on the topic extraction model.
As shown in fig. 7, it is a schematic structural diagram of the first model building module 706, which includes:
the first corpus collecting unit 7061 is configured to collect natural english corpuses;
a first topic labeling unit 7062, configured to label topics of the words in the natural english corpus;
and the first model training unit 7063 is configured to train according to the natural english corpus and the topic labeling information thereof to obtain a topic extraction model.
The method can acquire the topic information of each word in the English sentence to be detected based on the pre-constructed topic extraction model, and is simple, efficient and high in accuracy.
Further, the system may further include: a second model construction module 707 configured to construct a collocated word quality judgment model in advance;
the determining module 604 is specifically configured to determine the quality of each collocated word based on the collocated word quality determination model; and if the collocation word is a high-quality collocation word and no collocation word matched with the collocation word is found in the pre-constructed collocation word library, determining that the collocation word is a Chinese English collocation word.
As shown in fig. 8, it is a schematic structural diagram of the second model building module 707, including:
a second corpus collecting unit 7071, configured to collect natural english corpuses;
a second topic labeling unit 7072, configured to label topics of the words in the natural english corpus;
a generating unit 7073, configured to generate collocated words in the natural english corpus based on the topic tagging information of each word;
a feature extraction unit 7074, configured to extract the feature of the collocated word;
a quality labeling unit 7075 for labeling the quality of the collocation word
And a second model training unit 7076, configured to train according to the collocated word features and the quality labeling information to obtain a collocated word quality judgment model.
It should be noted that the second corpus collection unit and the second topic marking unit may be the same as the first corpus collection unit and the first topic marking unit, respectively, and are not limited herein.
The method can judge the quality of each collocation word based on the pre-established collocation word quality judgment model, and is simple, efficient and high in accuracy.
In this embodiment, the collocations characteristics include any one or more of the following: the co-occurrence frequency of the collocations, the point mutual information between different words in the collocations, the reverse document frequency of each word in the collocations, the number of stop words in the collocations, and the frequency ratio of the co-occurrence frequency of the current collocations to the frequency of the sub-collocations with the minimum frequency.
Further, the system can also display, prompt and/or voice broadcast the english sentence and/or chinese english collocation word that contains chinese english collocation word, as shown in fig. 9, for another kind of schematic structural diagram of the discovery system of chinese english, the system can also include:
a display module 808, configured to display, in a visual and/or auditory manner, an english sentence containing chinese english and/or a chinese english collocate; and/or
And the prompt module 809 is configured to prompt in a visual and/or audible form if the english sentence to be detected includes chinese english.
Prompting the user that the current english sentence contains chinese english through the presenting module 808 and/or the prompting module 809, specifically, the presenting module 808 includes any one or more of the following units:
the highlight display unit is used for displaying English sentences containing Chinese English and/or Chinese English collocations by adopting a highlight color or a color different from other display contents;
and the auxiliary display unit is used for displaying candidate corrected English sentences and/or English collocations corresponding to English sentences and/or Chinese English collocations containing Chinese English, and marking English sentences and/or Chinese English collocations containing Chinese English.
In addition, the system may further include a storage module (not shown) for storing relevant information such as model data, collocations quality, and the like. Therefore, the English sentence to be detected is conveniently subjected to computer automatic processing, and the final detection result is stored.
In the Chinese english discovering system provided by the embodiment of the present invention, the sentence acquisition module 601 acquires an english sentence to be detected, the topic acquisition module 602 acquires topic information of each word in the english sentence to be detected, the collocations word generation module 603 generates a collocations word in the english sentence to be detected, and the determination module 604 determines whether the english sentence to be detected contains a Chinese english collocations word to determine whether the english sentence to be detected contains Chinese english. In the traditional Chinese English set, Chinese English mistakes cannot be exhausted, so that the Chinese English set is difficult to collect, the collocation words in the English sentence to be detected are generated by the collocation word generation module 603 based on the topic information of each word acquired by the topic acquisition module 602, and the number of natural English collocation words is limited, so that a collocation word bank covering all natural English collocation words can be constructed, whether the current collocation word is a Chinese English collocation word can be effectively judged by using the collocation word bank, and the accuracy rate of Chinese English discovery is improved.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above embodiments of the present invention have been described in detail, and the present invention is described herein using specific embodiments, but the above embodiments are only used to help understanding the method and system of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (11)

1. A method for discovering Chinese English, comprising:
acquiring an English sentence to be detected;
obtaining the theme information of each word in the English sentence to be detected based on a pre-constructed theme extraction model, wherein the theme extraction model comprises a document theme generation model;
generating collocation words in the English sentence to be detected based on the topic information of the words, wherein the collocation words comprise the words with the same topic information in the English sentence to be detected to form a collocation word, and the collocation word refers to a word combination with the same topic in the same sentence;
determining whether Chinese English collocations exist in the collocations;
if so, determining that the English sentence to be detected contains Chinese English;
and if not, determining that the English sentence to be detected does not contain Chinese English.
2. The method of claim 1, wherein constructing a topic extraction model comprises:
collecting natural English corpus, and labeling the subject of each word in the natural English corpus;
and training according to the natural English corpus and the theme marking information thereof to obtain a theme extraction model.
3. The method of claim 1, further comprising: constructing a collocated word quality judgment model in advance;
the determining whether the Chinese English collocations exist in the collocations comprises the following steps:
determining the quality of each collocated word based on the collocated word quality judgment model;
and if the collocation word is a high-quality collocation word and no collocation word matched with the collocation word is found in the pre-constructed collocation word library, determining that the collocation word is a Chinese English collocation word.
4. The method of claim 3, wherein constructing the collocated word quality determination model comprises:
collecting natural English corpus, and labeling the subject of each word in the natural English corpus;
generating collocation words in the natural English corpus based on the topic marking information of each word;
extracting collocation word characteristics and marking collocation word quality, wherein the collocation word characteristics comprise any one or more of the following characteristics: co-occurrence frequency of the collocations, point mutual information among different words in the collocations, reverse document frequency of each word in the collocations, number of stop words in the collocations, and frequency ratio of co-occurrence frequency of the current collocations to the sub-collocations with the minimum frequency; the co-occurrence frequency of the collocations refers to the sum of the co-occurrence frequencies of all the sequences of the collocations; the frequency of the sub collocation word with the minimum frequency refers to the frequency of the sub collocation word with the minimum frequency in the natural corpus of the sub collocation of the current collocation word;
and training according to the matched word characteristics and the quality labeling information to obtain a matched word quality judgment model.
5. The method according to any one of claims 1 to 4, further comprising:
displaying English sentences containing Chinese English and/or Chinese English collocations in a visual and/or auditory mode; and/or
And if the English sentence to be detected contains Chinese English, prompting in a visual and/or auditory mode.
6. A system for discovering chinese english, comprising:
the sentence acquisition module is used for acquiring English sentences to be detected;
the theme acquisition module is used for acquiring the theme information of each word in the English sentence to be detected based on a preset theme extraction model, and the theme extraction model comprises a document theme generation model;
a collocation word generation module, configured to generate a collocation word in the to-be-detected english sentence based on the topic information of each word, including forming each word with the same topic information in the to-be-detected english sentence into a collocation word, where the collocation word refers to a word combination with the same topic in the same sentence;
the determining module is used for determining whether the Chinese English collocations exist in the collocations; if so, determining that the English sentence to be detected contains Chinese English; and if not, determining that the English sentence to be detected does not contain Chinese English.
7. The system of claim 6, further comprising:
the first model building module is used for building a theme extraction model in advance;
the theme obtaining module is specifically configured to obtain theme information of each word in the to-be-detected english sentence based on the theme extraction model.
8. The system of claim 7, wherein the first model building module comprises:
the first corpus collection unit is used for collecting natural English corpuses;
the first theme labeling unit is used for labeling the theme of each word in the natural English corpus;
and the first model training unit is used for training according to the natural English corpus and the theme marking information thereof to obtain a theme extraction model.
9. The system of claim 6, further comprising:
the second model building module is used for building a collocated word quality judgment model in advance;
the determining module is specifically used for determining the quality of each collocated word based on the collocated word quality judging model; and if the collocation word is a high-quality collocation word and no collocation word matched with the collocation word is found in the pre-constructed collocation word library, determining that the collocation word is a Chinese English collocation word.
10. The system of claim 9, wherein the second model building module comprises:
the second corpus collecting unit is used for collecting natural English corpuses;
the second theme labeling unit is used for labeling the theme of each word in the natural English corpus;
the generating unit is used for generating collocation words in the natural English corpus based on the subject marking information of each word;
the feature extraction unit is used for extracting collocation word features, and the collocation word features comprise any one or more of the following: co-occurrence frequency of the collocations, point mutual information among different words in the collocations, reverse document frequency of each word in the collocations, number of stop words in the collocations, and frequency ratio of co-occurrence frequency of the current collocations to the sub-collocations with the minimum frequency; the co-occurrence frequency of the collocations refers to the sum of the co-occurrence frequencies of all the sequences of the collocations; the frequency of the sub collocation word with the minimum frequency refers to the frequency of the sub collocation word with the minimum frequency in the natural corpus of the sub collocation of the current collocation word;
the quality labeling unit is used for labeling the quality of the collocated words;
and the second model training unit is used for training according to the matched word characteristics and the quality marking information to obtain a matched word quality judgment model.
11. The system according to any one of claims 6 to 10, further comprising:
the display module is used for displaying English sentences containing Chinese English and/or Chinese English collocations in a visual and/or auditory mode; and/or
And the prompting module is used for prompting in a visual and/or auditory mode if the English sentence to be detected contains Chinese English.
CN201610281264.8A 2016-04-27 2016-04-27 Chinese English discovering method and system Active CN107315732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610281264.8A CN107315732B (en) 2016-04-27 2016-04-27 Chinese English discovering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610281264.8A CN107315732B (en) 2016-04-27 2016-04-27 Chinese English discovering method and system

Publications (2)

Publication Number Publication Date
CN107315732A CN107315732A (en) 2017-11-03
CN107315732B true CN107315732B (en) 2021-03-23

Family

ID=60185632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610281264.8A Active CN107315732B (en) 2016-04-27 2016-04-27 Chinese English discovering method and system

Country Status (1)

Country Link
CN (1) CN107315732B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866468A (en) * 2015-04-08 2015-08-26 清华大学深圳研究生院 Method for identifying false Chinese customer reviews
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105260899A (en) * 2015-10-27 2016-01-20 清华大学深圳研究生院 Electronic business subject credibility evaluation method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031911B2 (en) * 2002-06-28 2006-04-18 Microsoft Corporation System and method for automatic detection of collocation mistakes in documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866468A (en) * 2015-04-08 2015-08-26 清华大学深圳研究生院 Method for identifying false Chinese customer reviews
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105260899A (en) * 2015-10-27 2016-01-20 清华大学深圳研究生院 Electronic business subject credibility evaluation method and system

Also Published As

Publication number Publication date
CN107315732A (en) 2017-11-03

Similar Documents

Publication Publication Date Title
Chakrabarty et al. AMPERSAND: Argument mining for PERSuAsive oNline discussions
CN107305541B (en) Method and device for segmenting speech recognition text
US10878035B2 (en) Interactive method and apparatus based on deep question and answer
KR101680007B1 (en) Method for scoring of supply type test papers, computer program and storage medium for thereof
Wang et al. Semeval-2021 task 9: Fact verification and evidence finding for tabular data in scientific documents (sem-tab-facts)
JP5728527B2 (en) Utterance candidate generation device, utterance candidate generation method, and utterance candidate generation program
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN109660865B (en) Method and device for automatically labeling videos, medium and electronic equipment
CN107679070B (en) Intelligent reading recommendation method and device and electronic equipment
CN110555440B (en) Event extraction method and device
CN112231522B (en) Online course knowledge tree generation association method
CN114547274B (en) Multi-turn question and answer method, device and equipment
CN108090099B (en) Text processing method and device
CN110852071B (en) Knowledge point detection method, device, equipment and readable storage medium
Gower et al. Leveraging pauses to improve video captions
KR20180013777A (en) Apparatus and method for analyzing irregular data, a recording medium on which a program / application for implementing the same
CN107315732B (en) Chinese English discovering method and system
Coats Skin tone emoji and sentiment on twitter
US10971148B2 (en) Information providing device, information providing method, and recording medium for presenting words extracted from different word groups
Kapitanov et al. Slovo: Russian Sign Language Dataset
CN111079489B (en) Content identification method and electronic equipment
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
JP2021131557A (en) Learning support device and questioning method
KR20140105214A (en) Dialog Engine for Speaking Training with ASR Dialog Agent
CN111767743A (en) Machine intelligent evaluation method and system for translation test questions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant