CN117034917B - English text word segmentation method, device and computer readable medium - Google Patents

English text word segmentation method, device and computer readable medium Download PDF

Info

Publication number
CN117034917B
CN117034917B CN202311292401.4A CN202311292401A CN117034917B CN 117034917 B CN117034917 B CN 117034917B CN 202311292401 A CN202311292401 A CN 202311292401A CN 117034917 B CN117034917 B CN 117034917B
Authority
CN
China
Prior art keywords
word
phrase
words
corpus
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311292401.4A
Other languages
Chinese (zh)
Other versions
CN117034917A (en
Inventor
陈娟
欧阳昭连
潘黎姿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medical Information CAMS
Original Assignee
Institute of Medical Information CAMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medical Information CAMS filed Critical Institute of Medical Information CAMS
Priority to CN202311292401.4A priority Critical patent/CN117034917B/en
Publication of CN117034917A publication Critical patent/CN117034917A/en
Application granted granted Critical
Publication of CN117034917B publication Critical patent/CN117034917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The method, the device and the computer readable medium effectively solve the word segmentation problem of the English text, construct a phrase set by carrying out phrase mining according to the context relation of words in the English text corpus, and construct a replacement word set by identifying the similarity among different words in the English text corpus, so that the automatic mining and construction of the phrase set and the replacement word set can be realized, and further, the efficient and accurate English text word segmentation can be realized. In addition, the stop words in the text are automatically mined, the field customization of the stop words is realized, the word segmentation result error caused by the traditional general stop words is avoided, and the cost of manually maintaining the stop word set is reduced. And the filtered content of the stop word filtering processing is recorded, so that the subsequent iterative tuning is performed on the audit feedback information of the filtered content, the whole word segmentation process and the filtered content are completely transparent, the interactivity of a user and a machine is greatly improved, and the accuracy of a word segmentation result is also obviously improved.

Description

English text word segmentation method, device and computer readable medium
Technical Field
The application belongs to the technical field of natural language processing, and particularly relates to an English text word segmentation method, device and computer readable medium.
Background
The English unstructured text contains a large amount of information, a high-quality word segmentation (word segmentation) result is the basis for the subsequent application of text feature extraction, text classification, text similarity calculation, text clustering, text topic modeling and the like, so that the necessity exists for word segmentation of the English unstructured text, and how to realize efficient and accurate word segmentation of the English text becomes a technical problem to be solved in the field.
Disclosure of Invention
Therefore, the application discloses an English text word segmentation method, device and computer readable medium, so as to at least solve the word segmentation problem of English text and realize efficient and accurate English text word segmentation.
The specific technical scheme is as follows:
an English text word segmentation method comprises the following steps:
acquiring a target text object to be processed, wherein the target text object is an English text object;
according to a pre-constructed phrase set, determining phrases used for forming phrases in the target text object, and carrying out merging processing on words contained in the same phrase; the phrase set is a set of phrases obtained by carrying out phrase mining at least depending on the context relation of words in English text corpus;
According to a pre-constructed replacement word set, word replacement processing is carried out on words in the target text object; the replacement word set is a set of replacement words determined at least by identifying the similarity between different words in the English text corpus;
according to the configured filtering conditions corresponding to the stop word set, performing stop word filtering processing on the target text object;
performing word segmentation on the target text object subjected to the merging processing, the word replacement processing and the stop word filtering processing to obtain a word segmentation result;
obtaining audit feedback information of the filtered content of the filtering processing of the stop words, and adjusting and processing corresponding information sets based on the audit feedback information;
the phrase set construction process comprises the following steps:
acquiring English text corpus objects of each subdivision domain in a preset total domain;
performing rough word segmentation based on a preset rough word segmentation strategy on the total corpus formed by each English text corpus object, and performing Ngram combination on a rough word segmentation result according to a context order to obtain each Ngram phrase;
according to the occurrence frequency of each Ngram phrase in the total corpus, determining a total field high-frequency phrase which meets a first high-frequency condition relative to the total corpus in each Ngram phrase;
Determining subdivision domain-specific phrases in each Ngram phrase which do not meet the first high-frequency condition relative to the total corpus and meet the second high-frequency condition relative to the belonging English text corpus object according to the occurrence frequency of each Ngram phrase in the total corpus and the occurrence frequency of each Ngram phrase in the belonging English text corpus object respectively, so as to form the phrase set based on the total domain high-frequency phrases and the subdivision domain-specific phrases; wherein, the phrases in the phrase set are original parts of speech;
the construction process of the replacement word set comprises the following steps:
determining a first sub-similarity between different words or phrases in the total corpus according to word vectors corresponding to the different words or phrases in the total corpus formed by each English text corpus object;
determining second sub-similarity among different words or phrases in the total corpus according to the combination characteristics of letters Ngram corresponding to the different words or phrases in the total corpus;
according to the corresponding first sub-similarity and second sub-similarity, determining the similarity between different words or phrases in the total corpus;
extracting words or phrases meeting similarity conditions from the total corpus according to the similarity between different words or phrases in the total corpus, so as to form the replacement word set based on the extracted words or phrases; wherein, the replacement words in the replacement word set are original word parts;
The construction process of the stop word set comprises the following steps:
determining the ratio of the number of English text corpus objects to which words in the English text corpus belong in the total number of English text corpus objects contained in the English text corpus;
determining word frequency of words in the English text total corpus in an belonged English text corpus object, and determining stop word scores corresponding to the words in the English text total corpus according to the corresponding duty ratio and the word frequency;
identifying the words with the corresponding stop word scores meeting the score conditions as stop words, and obtaining the stop word set;
the obtaining audit feedback information of the filtered content of the filtering processing of the stop words, and adjusting and processing corresponding information sets based on the audit feedback information comprises the following steps:
obtaining audit feedback information provided by manually auditing the filtered content of the stop word filtering process;
adjusting at least one of the phrase set, the replacement word set, the deactivated word set, and the reserved word set based on the audit feedback information; the reserved word set comprises words and/or phrases reserved in the word segmentation result in the word segmentation process, and is formulated by a user and can be updated according to the word segmentation result as required.
Optionally, after the word segmentation result is obtained, the method further includes:
converting each word and phrase in the word segmentation result into an original part of speech to obtain a corresponding original part of speech word;
and carrying out combination processing on the obtained original part-of-speech word according to the phrase set, carrying out word replacement processing on the obtained original part-of-speech word according to the replacement word set, and updating the word segmentation result by utilizing the combination processing and word replacement processing results of the original part-of-speech word.
An english text word segmentation apparatus comprising:
the acquisition module is used for acquiring a target text object to be processed, wherein the target text object is an English text object;
the merging processing module is used for determining the phrase used for forming the phrase in the target text object according to the phrase set constructed in advance and merging the words contained in the same phrase; the phrase set is a set of phrases obtained by carrying out phrase mining at least depending on the context relation of words in English text corpus;
the word replacement processing module is used for carrying out word replacement processing on the words in the target text object according to a pre-constructed replacement word set; the replacement word set is a set of replacement words determined at least by identifying the similarity between different words in the English text corpus;
The stop word filtering processing module is used for carrying out stop word filtering processing on the target text object according to the configured filtering conditions corresponding to the stop word set;
the word segmentation processing module is used for carrying out word segmentation on the target text object subjected to the merging processing, the word replacement processing and the stop word filtering processing to obtain a word segmentation result;
the adjustment processing module is used for acquiring audit feedback information of the filtered content of the filtering processing of the stop words and adjusting the corresponding information set based on the audit feedback information;
the device also comprises an information set construction module for constructing the phrase set, the replacement word set and the stop word set;
the information set construction module is specifically configured to, when constructing the phrase set:
acquiring English text corpus objects of each subdivision domain in a preset total domain;
performing rough word segmentation based on a preset rough word segmentation strategy on the total corpus formed by each English text corpus object, and performing Ngram combination on a rough word segmentation result according to a context order to obtain each Ngram phrase;
according to the occurrence frequency of each Ngram phrase in the total corpus, determining a total field high-frequency phrase which meets a first high-frequency condition relative to the total corpus in each Ngram phrase;
Determining subdivision domain-specific phrases in each Ngram phrase which do not meet the first high-frequency condition relative to the total corpus and meet the second high-frequency condition relative to the belonging English text corpus object according to the occurrence frequency of each Ngram phrase in the total corpus and the occurrence frequency of each Ngram phrase in the belonging English text corpus object respectively, so as to form the phrase set based on the total domain high-frequency phrases and the subdivision domain-specific phrases; wherein, the phrases in the phrase set are original parts of speech;
the information set construction module is specifically configured to, when constructing the replacement word set:
determining a first sub-similarity between different words or phrases in the total corpus according to word vectors corresponding to the different words or phrases in the total corpus formed by each English text corpus object;
determining second sub-similarity among different words or phrases in the total corpus according to the combination characteristics of letters Ngram corresponding to the different words or phrases in the total corpus;
according to the corresponding first sub-similarity and second sub-similarity, determining the similarity between different words or phrases in the total corpus;
extracting words or phrases meeting similarity conditions from the total corpus according to the similarity between different words or phrases in the total corpus, so as to form the replacement word set based on the extracted words or phrases; wherein, the replacement words in the replacement word set are original word parts;
The information set construction module is specifically configured to, when constructing the stop word set:
determining the ratio of the number of English text corpus objects to which words in the English text corpus belong in the total number of English text corpus objects contained in the English text corpus;
determining word frequency of words in the English text total corpus in an belonged English text corpus object, and determining stop word scores corresponding to the words in the English text total corpus according to the corresponding duty ratio and the word frequency;
identifying the words with the corresponding stop word scores meeting the score conditions as stop words, and obtaining the stop word set;
the adjustment processing module is specifically configured to, when obtaining audit feedback information of the content filtered by the filtering processing of the deactivated word and performing adjustment processing on a corresponding information set based on the audit feedback information:
obtaining audit feedback information provided by manually auditing the filtered content of the stop word filtering process;
adjusting at least one of the phrase set, the replacement word set, the deactivated word set, and the reserved word set based on the audit feedback information; the reserved word set comprises words and/or phrases reserved in the word segmentation result in the word segmentation process, and is formulated by a user and can be updated according to the word segmentation result as required.
Optionally, the apparatus further includes:
the original part-of-speech processing module is used for: after the word segmentation result is obtained, each word and phrase in the word segmentation result are converted into an original part of speech, and a corresponding original part of speech word is obtained; and carrying out combination processing on the obtained original part-of-speech word according to the phrase set, carrying out word replacement processing on the obtained original part-of-speech word according to the replacement word set, and updating the word segmentation result by utilizing the combination processing and word replacement processing results of the original part-of-speech word.
A computer readable medium having stored thereon a computer program comprising program code for performing the english text segmentation method as set forth in any one of the preceding claims.
According to the technical scheme, the English text word segmentation method, the English text word segmentation device and the computer readable medium disclosed by the application are used for determining the phrase used for forming the phrase in the target text object according to the pre-constructed phrase set by acquiring the target text object (English text object) to be processed, carrying out merging processing on the words contained in the same phrase, carrying out word replacement processing on the words in the target text object according to the pre-constructed replacement word set, and carrying out word segmentation on the target text object after the merging processing and the word replacement processing, so that the word segmentation problem of the English text is effectively solved. The phrase set is constructed at least by carrying out phrase mining according to the context relation of the words in the English text corpus, and the replacement word set is constructed at least by identifying the similarity between different words in the English text corpus, so that the automatic mining and the rapid and accurate information set construction of the phrase set and the replacement word set can be realized, and further, the efficient and accurate English text word segmentation can be realized based on the constructed phrase set and the replacement word set.
In addition, the method and the device automatically mine the stop words in the text, realize the field customization of the stop words, avoid word segmentation result errors caused by the traditional general stop words, and reduce the cost of manually maintaining the stop word set. And the filtered content of the stop word filtering processing is recorded, so that the subsequent iterative tuning is performed on the audit feedback information of the filtered content, the whole word segmentation process and the filtered content are completely transparent, the interactivity of a user and a machine is greatly improved, and the accuracy of a word segmentation result is also obviously improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow chart of a construction process of phrase sets provided herein;
FIG. 2 is a flow chart of a construction process of an alternative vocabulary provided in the present application;
FIG. 3 is a flow chart of a process for constructing a stop word set provided by the present application;
FIG. 4 is a flowchart of the method for word segmentation of English text provided in the present application;
FIG. 5 is a flowchart of an exemplary application of the method for word segmentation of English text provided in the present application;
fig. 6 is a structural diagram of the english text word segmentation apparatus provided in the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The applicant found that the known technique has at least the following drawbacks when performing english text segmentation:
1) The phrase collocation is required to be reserved in the English word segmentation process, the first method is to sort out phrase dictionaries in the field based on manual statistics observation, the method relies on a large amount of manual work, and the second method relies on a set phrase template to extract phrases of articles, but the phrase template can only be universally used, new vocabulary and new rules appearing in texts cannot be followed, and the flexibility is not enough;
2) In the industry, general field stop words are often adopted as filtering conditions, the technical center of gravity is placed in the word segmentation and subsequent algorithms, but the general stop words can reduce the accuracy of word segmentation results, and known techniques ignore the important influence of the stop words on natural language processing results.
3) The known technique is typically time-consuming and labor-consuming by manually identifying which vocabularies need to be combined and then manually creating a list of hyponyms/replacement vocabularies.
4) The known English text word segmentation mode is often to input a text segment, directly output a corresponding word segmentation result, so that a user is difficult to intervene in the word segmentation/word segmentation process, the interactivity is low, and the accuracy of the word segmentation result is difficult to ensure.
In summary, the known technology has the problems of large manual workload, low intervention/interaction capability, low word segmentation efficiency, low accuracy and the like in the word segmentation/word segmentation process when performing English text word segmentation.
Based on this, the present application provides a method, an apparatus and a computer readable medium for word segmentation of english text, so as to solve at least some of the above-mentioned problems in the prior art.
For the purpose of this application, various information sets such as phrase sets, alternative word sets, stop word sets and reserved word sets are constructed in advance based on an automatic mining mode, and the constructed various information sets can be in the forms of, but not limited to, a dictionary, a list, a phrase library and the like, for example, the phrase sets are specifically implemented as a phrase list or a phrase library (such as a term phrase library), and the alternative word sets are implemented as alternative word lists or alternative word libraries.
The phrase set is a set of phrases obtained by carrying out phrase mining at least depending on the context relation of words in the English text corpus, and the replacement word set is a set of replacement words determined at least by identifying the similarity between different words in the English text corpus.
Optionally, referring to fig. 1, the phrase set building process includes:
and step 101, acquiring English text corpus objects of each subdivision domain in a preset total domain.
The total field may be set as required, for example, the total field may be set as a medical field or may be set to include a medical field and other required fields, and the subdivided field may be each of the subdivided fields included in the total field.
The english text corpus object may be any unstructured english file/document of a desired type in a desired field, such as various english articles/documents in the desired field.
And 102, performing rough word segmentation based on a preset rough word segmentation strategy on the total corpus formed by the English text corpus objects, and performing Ngram combination on rough word segmentation results according to a context order to obtain each Ngram phrase.
Alternatively, the rough word segmentation based on the preset rough word segmentation strategy may be, but is not limited to, word segmentation of text content of the english text corpus object according to separators such as spaces.
In this embodiment, rough word segmentation is performed on all english text corpus objects, such as all articles, in the total corpus according to a preset rough word segmentation strategy to obtain corresponding rough word segmentation results, and then, nmram combination is performed on the rough word segmentation results according to a context order to obtain each nmram phrase.
Step 103, determining total field high-frequency phrases which meet a first high-frequency condition relative to the total corpus in each Ngram phrase according to the occurrence frequency of each Ngram phrase in the total corpus.
Optionally, a frequency threshold may be specifically set, the occurrence frequency of each Ngram phrase in the total corpus is counted, and the Ngram phrase whose occurrence frequency in the total corpus reaches the frequency threshold is determined and used as the total domain high-frequency phrase which satisfies the first high-frequency condition relative to the total corpus.
Step 104, determining sub-division domain specific phrases in each Ngram phrase which do not meet the first high-frequency condition relative to the total corpus and meet the second high-frequency condition relative to the English text corpus object according to the occurrence frequency of each Ngram phrase in the total corpus and the occurrence frequency of each Ngram phrase in the English text corpus object, so as to form the phrase set based on the high-frequency phrases and the sub-division domain specific phrases.
Alternatively, the text corpus object index is used as a row, the phrase index is used as a column, and the phrase word frequency duty ratio matrix is established. For example, using article index as row and phrase index as column, building m×n phrase word frequency duty ratio matrix:
wherein m is the number of lines, the number of English text corpus objects is represented, n is the number of columns, the number of Ngram phrases is represented, and p ij Phrase word frequency duty ratio, p, representing position of ith row and j columns in matrix ij The numerator and denominator in the calculation formula respectively represent the word frequency of the corresponding phrase in the belonged text corpus object and the total word frequency of the vocabulary in the belonged text corpus object.
And screening out high-frequency phrases in each text corpus object through the constructed phrase word frequency duty ratio matrix.
Then, for the screened phrases, calculating the IDF matrix of each phrase article, and setting a threshold Ia to obtain the phrases meeting IDF < = la. The threshold la of IDF is set therein in order to avoid phrase noise that occurs only in a very small number of articles at high frequencies. Specifically, the IDF matrix is constructed to set filtering conditions for high-frequency phrases, because the high-frequency phrase matrix is constructed in units of articles and Ngram phrases, the IDF has the effect of avoiding the influence of high-frequency phrases in a very small amount of text on the result.
IDF represents the word frequency of the reverse document, if the total text number is D and the text number of the phrase a is S, then: .
And calculating a TFIDF matrix of the phrase aiming at the obtained phrase to obtain a TFIDF sum of each phrase, setting a score threshold value, and reserving the phrase with the TFIDF sum higher than the threshold value, so as to obtain a subdivision domain specific phrase which does not meet the first high-frequency condition relative to the total corpus and meets the second high-frequency condition relative to the English text corpus object in each Ngram phrase, such as a subdivision domain specific phrase which does not occur frequently in the total text but occurs frequently in subdivision domain articles.
Based on step 103, it is likely that the phrase cannot be reserved, the larger the total text amount is, the more easily the sub-divided domain-specific phrase is lost, and the embodiments of the present application mine and reserve such phrases through step 104, so that finally, the mined total domain high-frequency phrase and sub-divided domain-specific phrase are combined to form a phrase set required for subsequent word segmentation, such as forming a phrase list or a term phrase library.
Optionally, after the total domain high-frequency phrase and the subdivision domain specific phrase are obtained through excavation, the excavation result can be submitted to manual review, and a preliminary phrase set is formed by combining the manual review, for example, a preliminary phrase list is formed, so that support is provided for subsequent English text word segmentation.
Wherein, the phrases in the phrase set are original parts of speech.
Optionally, referring to fig. 2, the construction process of the replacement vocabulary includes:
step 201, determining a first sub-similarity between different words or phrases in a total corpus formed by English text corpus objects according to word vectors corresponding to the different words or phrases in the total corpus.
Specifically, but not limited to, word2Vec Word vectors of words or phrases in the total corpus are utilized to calculate the similarity between different words/phrases in the total corpus, for example, word2Vec Word vectors are utilized to calculate an NxN similarity matrix (each element in the matrix represents the similarity between two corresponding words/phrases) between the words/phrases in the total corpus, and the NxN similarity matrix is used as a first sub-similarity between different words or phrases.
Step 202, determining a second sub-similarity between different words or phrases in the total corpus according to the combination characteristics of letters Ngram corresponding to the different words or phrases in the total corpus.
Specifically, a similarity matrix based on word/phrase alphabetic arrangement (Ngram combined feature) can be calculated for the total corpus (such a calculation algorithm is primarily named as WordLetter), and each element in the matrix represents the similarity between two corresponding words/phrases and is used as the second sub-similarity between the two corresponding words/phrases.
Taking vaccination and vaccinee as examples, the second sub-similarity is calculated in the following way:
the combination features of letters of two words, ngram, are determined respectively, and given that N is taken to be 2, the following can be obtained:
the feature vaccination= [ 'va', 'ac', 'cc', 'ci', 'in', 'na', 'at', 'ti', 'io', 'on'
Feature vaccccine= [ 'va', 'ac', 'cc', 'ci', 'in', 'ne', ]
Counting total feature arrangement:
['io', 'na', 'cc', 'at', 'ti', 'ne', 'in', 'on', 'ac', 'va', 'ci']
feature vector digitization:
feature vaccination= [1 1 1 1 1 01 1 1 1 1]
Feature vaccccine= [0 01 0 01 1 01 1 1]
Calculating the similarity between the two words as a second sub-similarity between the two words:
wherein v1 and v2 respectively represent the digitalized vector characteristics corresponding to vaccination and vaccine, alpha and beta represent super parameters, and n represents the common characteristic quantity of vaccination and vaccine.
Through the calculation mode, the similarity between the two is calculated as the second sub-similarity: sim= 0.8227486121839513 (α=1, β=5).
It should be noted that, in this application, step 202 is provided simultaneously with step 201, and the following considerations are mainly taken into account and the corresponding improvements are made:
word2Vec algorithm can calculate the distance of two words, but depends on the size of the corpus. The WordLetter algorithm can calculate mutual similarity based on the alphabetic order of words or phrases, and independent of the size of the corpus, and can mine similar words or phrases using pre-calculated offline results. In the embodiment of the present application, step 202 is provided simultaneously with step 201, so that the similarity matrices obtained based on Word2Vec and Word letters are combined through subsequent steps, and each unit takes the maximum value of the two, so that more words or phrases that may be similar can be reserved.
Step 203, determining the similarity between different words or phrases in the total corpus according to the corresponding first sub-similarity and second sub-similarity.
After the first sub-similarity and the second sub-similarity between different words/phrases in the total corpus are obtained, optionally, the sub-similarity with the largest value in the first sub-similarity and the second sub-similarity between the different words/phrases is used as the similarity between different words/short words.
For example, the NxN similarity matrix between words/phrases in the total corpus calculated by using Word2Vec Word vectors is specifically combined with a similarity matrix based on Word/phrase alphabetic arrangement (Ngram combination feature), and the element with the largest value in every two corresponding position elements (the elements corresponding to the same Word/phrase pair in different matrixes) is taken as the similarity between the Word/phrase pair corresponding to the position, so as to preserve more possibly similar words or phrases:
step 204, extracting words or phrases meeting the similarity condition from the total corpus according to the similarity between different words or phrases in the total corpus, so as to form a replacement word set based on the extracted words or phrases.
After the similarity between different words/phrases in the total corpus is obtained, the words/phrases reaching the similarity threshold in the total corpus can be extracted according to the set similarity threshold to form a replacement word set.
Optionally, after extracting the words/phrases reaching the similarity threshold in the total corpus, the extraction result can be manually reviewed and pruned, and a preliminary replacement word set is formed after the review is completed, for example, a preliminary replacement word list is formed, so that support is provided for subsequent english text word segmentation.
Wherein, the replacement words in the replacement word set are original parts of speech.
According to the method and the device, phrase mining is carried out to form the phrase set by depending on the context relation of the words in the English text corpus, and the replacement word set is constructed by identifying the similarity among different words in the English text corpus, so that the phrase set and the replacement word set are constructed based on an automatic mining mode (and a manual review mode can be combined), the labor cost can be effectively reduced, the efficiency can be improved, real-time efficient generation of the information set required by word segmentation of the phrase set and the replacement word set based on a computer is realized, and timeliness of the required information set is ensured.
Referring to fig. 3, the construction process of the stop word set includes:
step 301, determining the ratio of the number of english text corpus objects to which words in the total english text corpus belong in the total number of english text corpus objects contained in the total english text corpus.
Specifically, firstly, performing rough segmentation on text content of the total english text, for example, directly performing rough segmentation according to spaces under the condition that term phrase combination and word replacement are not performed, obtaining each word/segmentation in the total text, counting how many text corpus objects appear in each segmentation, calculating the proportion of the number of objects with the word appearing in the total text corpus to the total number of text corpus objects contained in the total text corpus, for example, counting how many articles with each segmentation appear in, and calculating the proportion ra of the number of articles with the word appearing in the total article number.
Step 302, determining word frequency of words in the total corpus in the text corpus object, and determining deactivated word scores corresponding to the words in the total corpus according to the corresponding duty ratio and the word frequency.
Optionally, each word whose proportion of the number of the text corpus objects to the total number of the text corpus objects contained in the total corpus reaches a preset proportion threshold can be screened out, and the word frequency of the screened word in the text corpus objects can be determined.
For example, a proportion threshold r is set, corresponding words with a proportion ra > =r are counted, the words are screened out to form a word sequence, word frequencies of the words in the word sequence in each article are counted one by one, and a word-article word frequency matrix is obtained.
Thereafter, determining the stop word score corresponding to each word according to the duty ratio and the word frequency corresponding to the word in the word sequence, and by way of example and not limitation, calculating the stop word score corresponding to the word according to the following calculation formula:
wherein P is i Representing the word frequency of word i in the document.
This formula is designed based primarily on the following considerations:
if the word frequency distribution of a word in each document tends to be uniform, for example: "appears more frequently in a large number of documents, and this word is more likely to be a nonsensical word. The larger the value of sigma (Pi log (Pi)), the more uniform the distribution. If a word appears in the entire text, the weight of the word to distinguish the subject matter of each article is reduced, the meaning of the word is reduced, and possibly the word is deactivated, for example: the word "will likely appear in more than 90% of the text, while the word" algorithm "will not. The greater the 1/IDF, the more frequently it is, and the more likely it is, the stop word, so the two are multiplied to give the result as the weight of the stop word.
Step 303, identifying the corresponding words with the score meeting the scoring condition as stop words, and obtaining the stop word set.
Alternatively, a stop word score threshold may be specifically set, and the words with the corresponding stop word scores less than the threshold are counted as words meeting the scoring condition, and identified as stop words, and a stop word set is formed, for example, a stop word list is formed. Therefore, the real-time generation of the stop word set in the computer can be realized, and the timeliness of the stop word dictionary is ensured.
The constructed reserved word set contains words and/or phrases for remaining in the word segmentation result in the word segmentation process.
The reserved word set is formulated by a user, can be updated according to the word segmentation result as required, and can be repeatedly supplemented and modified according to the word segmentation result, so that the interactivity between the user and the machine in the processing flow is further improved.
Based on various information sets such as the constructed phrase set, the replacement word set, the stop word set and the reserved word set, referring to a flowchart of an english text word segmentation method shown in fig. 4, the english text word segmentation method provided in the application at least includes:
step 401, obtaining a target text object to be processed, wherein the target text object is an english text object.
In particular, but not limited to, obtaining an english article to be segmented.
Step 402, determining the phrase used for forming the phrase in the target text object according to the pre-constructed phrase set, and merging the words contained in the same phrase; the phrase set is a set of phrases obtained by phrase mining at least depending on the context of words in English text corpus.
After the target text object, such as an english article to be segmented, is obtained, the phrase set constructed can be specifically used to match the target text object, so as to match each phrase used for forming the phrase, and the words contained in the same phrase are combined.
When the words contained in the same phrase are combined, a connector can be used for connecting the words contained in the same phrase by underlining "_" specifically, for example, "covid_19 vaccinee" in the same phrase is combined into "covid_19_vaccinee", so that each word contained in the combined phrase (such as a term phrase) is not split in the subsequent word segmentation operation.
Step 403, performing word replacement processing on words in the target text object according to the pre-constructed replacement word set; the set of replacement words is a set of replacement words determined at least by identifying similarity between different words in the english text corpus.
Alternatively, the replaced word and the replacement word may be set for the replacement word set based on the requirement of normalization or the like. The replaced words and the replacement words may be words and/or phrases, without limitation.
In this step, a pre-constructed replacement word set, such as a replacement word library, is further used to perform word replacement processing on the target text object to implement the required content replacement, where the word replacement processing may be performed on the target text object after the merging processing, or the word replacement processing may also be performed on the target text object in step 401, which is not limited; if the replaced content is a phrase, the words contained in the phrase are concatenated with a connector such as "_" to represent them as independent phrases so that they are not split in subsequent word segmentation operations.
And 404, performing stop word filtering processing on the target text object according to the configured filtering conditions corresponding to the stop word set.
Wherein, the stop words, the lengths, the parts of speech and the regular expressions for filtering to be filtered can be configured in the filtering condition, and the target text object is subjected to content filtering through the configured stop words, lengths, parts of speech, regular expressions and the like for filtering. Specifically, the content filtering may be performed on the target text object after the merging process and the word replacement process, or the filtering may be performed on the target text object in step 401, which is not limited.
After the filtering of the stop words is completed, the target text object can be further segmented through the following corresponding steps. Alternatively, in practical application, word segmentation may be performed first, and the word segmentation result may be filtered according to the filtering condition after the word segmentation is completed.
And 405, word segmentation is carried out on the target text object subjected to the merging processing, the word replacement processing and the stop word filtering processing, so that a word segmentation result is obtained.
In this step, the word segmentation operation is further performed on the target text object, and optionally, after the merging process, the word replacement process and the deactivated word filtering process are completed, the word segmentation operation may be further performed on the target text object after the various processes, for example, the content in the target text object after the various processes is segmented into a corresponding series of words and phrases (such as term and phrase) according to space and other separators, so as to provide support for applications such as text feature extraction, text classification, text similarity calculation, text clustering and text topic modeling.
In practical application, as described above, word segmentation may be performed first, and the word segmentation result may be filtered according to the filtering condition after the word segmentation is completed.
And 406, obtaining audit feedback information of the filtered content of the filtering processing of the stop words, and adjusting the corresponding information set based on the audit feedback information.
Optionally, the filtering content of the manual auditing word segmentation can be handed over, auditing feedback information provided by manually auditing the filtering content is obtained, and the phrase set, the replacement word set, the stop word set and the corresponding information set in the reserved word set are adjusted according to requirements based on the auditing feedback information.
That is, in this embodiment, an intervention channel for the word segmentation process is further provided, which can support the intervention interaction of the user for the word segmentation process, and the filtering content of the word segmentation is manually audited/reviewed, and the word stock, the term phrase stock and the replacement vocabulary are optimized and deactivated according to the auditing result, and the compulsory reserved word can be added according to the requirement, so as to improve the interactivity of the whole word segmentation process.
The phrase list, the stop word list, the replacement word list and the reserved word list can be optimized to the greatest extent through the modes of machine assistance, manual intervention, multiple feedback iterations and the like, but are not limited to the modes of machine assistance, manual intervention, multiple feedback iterations and the like.
Referring to fig. 5, an exemplary application flow of the english text segmentation method of the application is provided.
The method mainly comprises the steps of automatically constructing information lists/word lists such as phrase lists, stop word lists, replacement word lists and reserved word lists according to the construction process of each information set provided by the application, wherein the constructed information lists/word lists such as phrase lists, stop word lists, replacement word lists and reserved word lists are respectively one realization form of the information sets such as phrase sets, stop word sets, replacement word sets and reserved word sets, and when various information lists/word lists are constructed, manual screening can be carried out according to actual requirements. Based on the above, based on the constructed various information lists/word lists, corresponding processing is carried out on the English text to be processed, such as merging processing, general word filtering, word replacement processing and the like, further word segmentation processing is carried out on the English text based on the processing results, then word segmentation results can be derived, and the required applications of text feature extraction, text classification and the like are developed based on the word segmentation results. For a more detailed process flow of this application example, please refer to fig. 5.
Based on the embodiment of the application, the word segmentation process and the filtering content can be made transparent, the interactivity between the user and the machine is greatly improved, and the accuracy of the word segmentation result can be greatly improved correspondingly.
In the word segmentation process, a plurality of filtering conditions, such as stop words, parts of speech, lengths, regular expression matching and the like, are adopted, only word segmentation results obtained by word segmentation operation are recorded in the prior art, and a series of technical indexes are adopted to evaluate the advantages and disadvantages of the word segmentation results, but filtered contents are ignored. The whole word segmentation process and the filtering content are completely transparent, the interactivity between a user and a machine is greatly improved, and the accuracy of word segmentation results is obviously improved.
According to the English text word segmentation method disclosed by the application, the target text object (English text object) to be processed is obtained, the phrase group used for forming the phrase in the target text object is determined according to the pre-constructed phrase set, the words contained in the same phrase are combined, the words in the target text object are subjected to word replacement according to the pre-constructed replacement word set, and the word segmentation is performed on the target text object after the combination and word replacement, so that the word segmentation problem of the English text is effectively solved. The phrase set is constructed at least by carrying out phrase mining according to the context relation of the words in the English text corpus, and the replacement word set is constructed at least by identifying the similarity between different words in the English text corpus, so that the automatic mining and the rapid and accurate information set construction of the phrase set and the replacement word set can be realized, and further, the efficient and accurate English text word segmentation can be realized based on the constructed phrase set and the replacement word set.
And moreover, by constructing the stop word set and executing the stop word filtering process, stop words in the text can be automatically mined, the field customization of the stop words is realized, word segmentation result errors caused by the traditional general stop words are avoided, and the cost of manually maintaining the stop word set is reduced. And the filtered content can be filtered by recording the stop word, so that the subsequent iterative tuning based on the audit feedback information of the filtered content is realized, for example, the stop word set, the phrase set, the replacement word set and the reserved word set are optimized, the whole word segmentation process and the filtered content are completely transparent, the interactivity between a user and a machine is greatly improved, and the accuracy of a word segmentation result is also obviously improved.
In an optional embodiment, after obtaining the word segmentation result, the english text word segmentation method provided in the application may further include the following processing:
converting each word and phrase in the word segmentation result into an original part of speech to obtain a corresponding original part of speech word; and carrying out combination processing on the obtained original part-of-speech word according to the phrase set, carrying out word replacement processing on the obtained original part-of-speech word according to the replacement word set, and updating the word segmentation result by utilizing the combination processing and word replacement processing results of the original part-of-speech word.
In this embodiment, each word and phrase in the word segmentation result is converted into an original part of speech, phrase merging (e.g., term phrase merging) and word replacement processing are performed again on the original part of speech word segmentation data obtained by the conversion, and the original word segmentation result obtained in step 304 is updated by using the merging processing and word replacement processing result of the original part of speech word, so as to avoid errors caused by word segmentation operation due to non-original part of speech problems existing in part of speech or single complex number.
For example: given that the original part-of-speech phrase covid_19_vaccination is known, and that the objects of operations such as covid_19 vaccination, merging and the like are necessarily independent words/phrases (i.e. original part-of-speech words/phrases), by converting each word/phrase into the original part-of-speech, covid_19 vaccination in the original text is replaced by covid_19 vaccination, and then the condition of phrase merging is satisfied, and the same is true for word replacement.
According to the embodiment, errors caused by word segmentation operation due to non-original word part problems in parts of speech or single-complex aspects and the like can be avoided, and accordingly accuracy of word segmentation results can be further improved.
Corresponding to the above method, the present application further provides an english text word segmentation device, where the device has a composition structure as shown in fig. 6, and includes:
The acquiring module 601 is configured to acquire a target text object to be processed, where the target text object is an english text object;
the merging processing module 602 is configured to determine, according to a phrase set constructed in advance, a phrase for forming a phrase in the target text object, and perform merging processing on words included in the same phrase; the phrase set is a set of phrases obtained by carrying out phrase mining at least depending on the context relation of words in English text corpus;
a word replacement processing module 603, configured to perform word replacement processing on the words in the target text object according to a pre-constructed replacement word set; the replacement word set is a set of replacement words determined at least by identifying the similarity between different words in the English text corpus;
the stop word filtering processing module 604 is configured to perform stop word filtering processing on the target text object after the merging processing and the word replacement processing according to the configured filtering condition corresponding to the stop word set;
the word segmentation processing module 605 is configured to segment the target text object after the merging process, the word replacement process, and the stop word filtering process, to obtain a word segmentation result;
The adjustment processing module 606 is configured to obtain audit feedback information of the filtered content in the filtering process of the stop word, and perform adjustment processing on a corresponding information set based on the audit feedback information;
the device also comprises an information set construction module for constructing the phrase set, the replacement word set and the stop word set;
the information set construction module is specifically configured to, when constructing the phrase set:
acquiring English text corpus objects of each subdivision domain in a preset total domain;
performing rough word segmentation based on a preset rough word segmentation strategy on the total corpus formed by each English text corpus object, and performing Ngram combination on a rough word segmentation result according to a context order to obtain each Ngram phrase;
according to the occurrence frequency of each Ngram phrase in the total corpus, determining a total field high-frequency phrase which meets a first high-frequency condition relative to the total corpus in each Ngram phrase;
determining subdivision domain-specific phrases in each Ngram phrase which do not meet the first high-frequency condition relative to the total corpus and meet the second high-frequency condition relative to the belonging English text corpus object according to the occurrence frequency of each Ngram phrase in the total corpus and the occurrence frequency of each Ngram phrase in the belonging English text corpus object respectively, so as to form the phrase set based on the total domain high-frequency phrases and the subdivision domain-specific phrases; wherein, the phrases in the phrase set are original parts of speech;
The information set construction module is specifically configured to, when constructing the replacement word set:
determining a first sub-similarity between different words or phrases in the total corpus according to word vectors corresponding to the different words or phrases in the total corpus formed by each English text corpus object;
determining second sub-similarity among different words or phrases in the total corpus according to the combination characteristics of letters Ngram corresponding to the different words or phrases in the total corpus;
according to the corresponding first sub-similarity and second sub-similarity, determining the similarity between different words or phrases in the total corpus;
extracting words or phrases meeting similarity conditions from the total corpus according to the similarity between different words or phrases in the total corpus, so as to form the replacement word set based on the extracted words or phrases; wherein, the replacement words in the replacement word set are original word parts;
the information set construction module is specifically configured to, when constructing the stop word set:
determining the ratio of the number of English text corpus objects to which words in the English text corpus belong in the total number of English text corpus objects contained in the English text corpus;
determining word frequency of words in the English text total corpus in an belonged English text corpus object, and determining stop word scores corresponding to the words in the English text total corpus according to the corresponding duty ratio and the word frequency;
Identifying the words with the corresponding stop word scores meeting the score conditions as stop words, and obtaining the stop word set;
the adjustment processing module is specifically configured to, when obtaining audit feedback information of the content filtered by the filtering processing of the deactivated word and performing adjustment processing on a corresponding information set based on the audit feedback information:
obtaining audit feedback information provided by manually auditing the filtered content of the stop word filtering process;
adjusting at least one of the phrase set, the replacement word set, the deactivated word set, and the reserved word set based on the audit feedback information; the reserved word set comprises words and/or phrases reserved in the word segmentation result in the word segmentation process, and is formulated by a user and can be updated according to the word segmentation result as required.
In one embodiment, the apparatus further includes an parts-of-speech processing module configured to: after the word segmentation result is obtained, each word and phrase in the word segmentation result are converted into an original part of speech, and a corresponding original part of speech word is obtained; and carrying out combination processing on the obtained original part-of-speech word according to the phrase set, carrying out word replacement processing on the obtained original part-of-speech word according to the replacement word set, and updating the word segmentation result by utilizing the combination processing and word replacement processing results of the original part-of-speech word.
For the english text word segmentation apparatus provided in the embodiments of the present application, since the apparatus corresponds to the english text word segmentation method provided in the above method embodiments, the description is relatively simple, and the relevant similarities will only be found in the description of the above method embodiments, which is not described in detail herein.
The present application also provides a computer readable medium having stored thereon a computer program comprising program code for performing the english text segmentation method as provided by any of the method embodiments above.
In the context of this application, a computer-readable medium (machine-readable medium) can be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It should be noted that the computer readable medium described in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal that propagates in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be embodied in an electronic device; or may exist alone without being assembled into an electronic device.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
For convenience of description, the above system or apparatus is described as being functionally divided into various modules or units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.
From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or inventive contributing portions thereof in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present application.
Finally, it is further noted that relational terms such as first, second, third, fourth, and the like are used herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (5)

1. The English text word segmentation method is characterized by comprising the following steps of:
acquiring a target text object to be processed, wherein the target text object is an English text object;
according to a pre-constructed phrase set, determining phrases used for forming phrases in the target text object, and carrying out merging processing on words contained in the same phrase; the phrase set is a set of phrases obtained by carrying out phrase mining at least depending on the context relation of words in English text corpus;
according to a pre-constructed replacement word set, word replacement processing is carried out on words in the target text object; the replacement word set is a set of replacement words determined at least by identifying the similarity between different words in the English text corpus;
according to the configured filtering conditions corresponding to the stop word set, performing stop word filtering processing on the target text object;
performing word segmentation on the target text object subjected to the merging processing, the word replacement processing and the stop word filtering processing to obtain a word segmentation result;
obtaining audit feedback information of the filtered content of the filtering processing of the stop words, and adjusting and processing corresponding information sets based on the audit feedback information;
The phrase set construction process comprises the following steps:
acquiring English text corpus objects of each subdivision domain in a preset total domain;
performing rough word segmentation based on a preset rough word segmentation strategy on the total corpus formed by each English text corpus object, and performing Ngram combination on a rough word segmentation result according to a context order to obtain each Ngram phrase;
according to the occurrence frequency of each Ngram phrase in the total corpus, determining a total field high-frequency phrase which meets a first high-frequency condition relative to the total corpus in each Ngram phrase;
determining subdivision domain-specific phrases in each Ngram phrase which do not meet the first high-frequency condition relative to the total corpus and meet the second high-frequency condition relative to the belonging English text corpus object according to the occurrence frequency of each Ngram phrase in the total corpus and the occurrence frequency of each Ngram phrase in the belonging English text corpus object respectively, so as to form the phrase set based on the total domain high-frequency phrases and the subdivision domain-specific phrases; wherein, the phrases in the phrase set are original parts of speech;
the construction process of the replacement word set comprises the following steps:
determining a first sub-similarity between different words or phrases in the total corpus according to word vectors corresponding to the different words or phrases in the total corpus formed by each English text corpus object;
Determining second sub-similarity among different words or phrases in the total corpus according to the combination characteristics of letters Ngram corresponding to the different words or phrases in the total corpus;
according to the corresponding first sub-similarity and second sub-similarity, determining the similarity between different words or phrases in the total corpus;
extracting words or phrases meeting similarity conditions from the total corpus according to the similarity between different words or phrases in the total corpus, so as to form the replacement word set based on the extracted words or phrases; wherein, the replacement words in the replacement word set are original word parts;
the construction process of the stop word set comprises the following steps:
determining the ratio of the number of English text corpus objects to which words in the English text corpus belong in the total number of English text corpus objects contained in the English text corpus;
determining word frequency of words in the English text total corpus in an belonged English text corpus object, and determining stop word scores corresponding to the words in the English text total corpus according to the corresponding duty ratio and the word frequency;
identifying the words with the corresponding stop word scores meeting the score conditions as stop words, and obtaining the stop word set;
The obtaining audit feedback information of the filtered content of the filtering processing of the stop words, and adjusting and processing corresponding information sets based on the audit feedback information comprises the following steps:
obtaining audit feedback information provided by manually auditing the filtered content of the stop word filtering process;
adjusting at least one of the phrase set, the replacement word set, the deactivated word set, and the reserved word set based on the audit feedback information; the reserved word set comprises words and/or phrases reserved in the word segmentation result in the word segmentation process, and is formulated by a user and can be updated according to the word segmentation result as required.
2. The english text word segmentation method according to claim 1, further comprising, after obtaining the word segmentation result:
converting each word and phrase in the word segmentation result into an original part of speech to obtain a corresponding original part of speech word;
and carrying out combination processing on the obtained original part-of-speech word according to the phrase set, carrying out word replacement processing on the obtained original part-of-speech word according to the replacement word set, and updating the word segmentation result by utilizing the combination processing and word replacement processing results of the original part-of-speech word.
3. An english text word segmentation apparatus, comprising:
the acquisition module is used for acquiring a target text object to be processed, wherein the target text object is an English text object;
the merging processing module is used for determining the phrase used for forming the phrase in the target text object according to the phrase set constructed in advance and merging the words contained in the same phrase; the phrase set is a set of phrases obtained by carrying out phrase mining at least depending on the context relation of words in English text corpus;
the word replacement processing module is used for carrying out word replacement processing on the words in the target text object according to a pre-constructed replacement word set; the replacement word set is a set of replacement words determined at least by identifying the similarity between different words in the English text corpus;
the stop word filtering processing module is used for carrying out stop word filtering processing on the target text object according to the configured filtering conditions corresponding to the stop word set;
the word segmentation processing module is used for carrying out word segmentation on the target text object subjected to the merging processing, the word replacement processing and the stop word filtering processing to obtain a word segmentation result;
The adjustment processing module is used for acquiring audit feedback information of the filtered content of the filtering processing of the stop words and adjusting the corresponding information set based on the audit feedback information;
the device also comprises an information set construction module for constructing the phrase set, the replacement word set and the stop word set;
the information set construction module is specifically configured to, when constructing the phrase set:
acquiring English text corpus objects of each subdivision domain in a preset total domain;
performing rough word segmentation based on a preset rough word segmentation strategy on the total corpus formed by each English text corpus object, and performing Ngram combination on a rough word segmentation result according to a context order to obtain each Ngram phrase;
according to the occurrence frequency of each Ngram phrase in the total corpus, determining a total field high-frequency phrase which meets a first high-frequency condition relative to the total corpus in each Ngram phrase;
determining subdivision domain-specific phrases in each Ngram phrase which do not meet the first high-frequency condition relative to the total corpus and meet the second high-frequency condition relative to the belonging English text corpus object according to the occurrence frequency of each Ngram phrase in the total corpus and the occurrence frequency of each Ngram phrase in the belonging English text corpus object respectively, so as to form the phrase set based on the total domain high-frequency phrases and the subdivision domain-specific phrases; wherein, the phrases in the phrase set are original parts of speech;
The information set construction module is specifically configured to, when constructing the replacement word set:
determining a first sub-similarity between different words or phrases in the total corpus according to word vectors corresponding to the different words or phrases in the total corpus formed by each English text corpus object;
determining second sub-similarity among different words or phrases in the total corpus according to the combination characteristics of letters Ngram corresponding to the different words or phrases in the total corpus;
according to the corresponding first sub-similarity and second sub-similarity, determining the similarity between different words or phrases in the total corpus;
extracting words or phrases meeting similarity conditions from the total corpus according to the similarity between different words or phrases in the total corpus, so as to form the replacement word set based on the extracted words or phrases; wherein, the replacement words in the replacement word set are original word parts;
the information set construction module is specifically configured to, when constructing the stop word set:
determining the ratio of the number of English text corpus objects to which words in the English text corpus belong in the total number of English text corpus objects contained in the English text corpus;
determining word frequency of words in the English text total corpus in an belonged English text corpus object, and determining stop word scores corresponding to the words in the English text total corpus according to the corresponding duty ratio and the word frequency;
Identifying the words with the corresponding stop word scores meeting the score conditions as stop words, and obtaining the stop word set;
the adjustment processing module is specifically configured to, when obtaining audit feedback information of the content filtered by the filtering processing of the deactivated word and performing adjustment processing on a corresponding information set based on the audit feedback information:
obtaining audit feedback information provided by manually auditing the filtered content of the stop word filtering process;
adjusting at least one of the phrase set, the replacement word set, the deactivated word set, and the reserved word set based on the audit feedback information; the reserved word set comprises words and/or phrases reserved in the word segmentation result in the word segmentation process, and is formulated by a user and can be updated according to the word segmentation result as required.
4. The english text word segmentation device of claim 3, further comprising:
the original part-of-speech processing module is used for: after the word segmentation result is obtained, each word and phrase in the word segmentation result are converted into an original part of speech, and a corresponding original part of speech word is obtained; and carrying out combination processing on the obtained original part-of-speech word according to the phrase set, carrying out word replacement processing on the obtained original part-of-speech word according to the replacement word set, and updating the word segmentation result by utilizing the combination processing and word replacement processing results of the original part-of-speech word.
5. A computer readable medium, having stored thereon a computer program comprising program code for performing the english text segmentation method according to any one of claims 1-2.
CN202311292401.4A 2023-10-08 2023-10-08 English text word segmentation method, device and computer readable medium Active CN117034917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311292401.4A CN117034917B (en) 2023-10-08 2023-10-08 English text word segmentation method, device and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311292401.4A CN117034917B (en) 2023-10-08 2023-10-08 English text word segmentation method, device and computer readable medium

Publications (2)

Publication Number Publication Date
CN117034917A CN117034917A (en) 2023-11-10
CN117034917B true CN117034917B (en) 2023-12-22

Family

ID=88602710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311292401.4A Active CN117034917B (en) 2023-10-08 2023-10-08 English text word segmentation method, device and computer readable medium

Country Status (1)

Country Link
CN (1) CN117034917B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN108694167A (en) * 2018-04-11 2018-10-23 广州视源电子科技股份有限公司 Candidate word appraisal procedure, candidate word sort method and device
CN110059185A (en) * 2019-04-03 2019-07-26 天津科技大学 A kind of medical files specialized vocabulary automation mask method
CN111581952A (en) * 2020-05-20 2020-08-25 长沙理工大学 Large-scale replaceable word bank construction method for natural language information hiding
CN111814474A (en) * 2020-09-14 2020-10-23 智者四海(北京)技术有限公司 Domain phrase mining method and device
CN115587590A (en) * 2022-10-13 2023-01-10 北京金山数字娱乐科技有限公司 Training corpus construction method, translation model training method and translation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020422B (en) * 2018-11-26 2020-08-04 阿里巴巴集团控股有限公司 Feature word determining method and device and server

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426539A (en) * 2015-12-23 2016-03-23 成都电科心通捷信科技有限公司 Dictionary-based lucene Chinese word segmentation method
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN108694167A (en) * 2018-04-11 2018-10-23 广州视源电子科技股份有限公司 Candidate word appraisal procedure, candidate word sort method and device
CN110059185A (en) * 2019-04-03 2019-07-26 天津科技大学 A kind of medical files specialized vocabulary automation mask method
CN111581952A (en) * 2020-05-20 2020-08-25 长沙理工大学 Large-scale replaceable word bank construction method for natural language information hiding
CN111814474A (en) * 2020-09-14 2020-10-23 智者四海(北京)技术有限公司 Domain phrase mining method and device
CN115587590A (en) * 2022-10-13 2023-01-10 北京金山数字娱乐科技有限公司 Training corpus construction method, translation model training method and translation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Comparing Neural- and N-Gram-Based Language Models for Word Segmentation;Yerai Doval等;《JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY》;全文 *
基于混合机器学习模型的短文本语义相似性度量算法;韩开旭等;《吉林大学学报(理学版)》;第61卷(第4期);全文 *

Also Published As

Publication number Publication date
CN117034917A (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN109299480B (en) Context-based term translation method and device
Khan et al. Extractive based text summarization using k-means and tf-idf
CN109960724B (en) Text summarization method based on TF-IDF
CN109858028B (en) Short text similarity calculation method based on probability model
CN102622338A (en) Computer-assisted computing method of semantic distance between short texts
CN111611807B (en) Keyword extraction method and device based on neural network and electronic equipment
CN109522547B (en) Chinese synonym iteration extraction method based on pattern learning
CN111897917B (en) Rail transit industry term extraction method based on multi-modal natural language features
CN110705291A (en) Word segmentation method and system for documents in ideological and political education field based on unsupervised learning
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN111177375B (en) Electronic document classification method and device
Kotenko et al. Evaluation of text classification techniques for inappropriate web content blocking
CN114266256A (en) Method and system for extracting new words in field
CN107239455B (en) Core word recognition method and device
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
CN117034917B (en) English text word segmentation method, device and computer readable medium
CN116881463A (en) Artistic multi-mode corpus construction system based on data
CN115688788A (en) Training method and related equipment for named entity recognition model in audit field
Makinist et al. Preparation of improved Turkish dataset for sentiment analysis in social media
CN114580557A (en) Document similarity determination method and device based on semantic analysis
CN112395856B (en) Text matching method, text matching device, computer system and readable storage medium
CN110162791B (en) Text keyword extraction method and system for national defense science and technology field
JP4314271B2 (en) Inter-word relevance calculation device, inter-word relevance calculation method, inter-word relevance calculation program, and recording medium recording the program
CN112800211A (en) Method for extracting critical information of criminal process in legal document based on TextRank algorithm
CN112507687A (en) Work order retrieval method based on secondary sorting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant