CN111104801B - Text word segmentation method, system, equipment and medium based on website domain name - Google Patents

Text word segmentation method, system, equipment and medium based on website domain name Download PDF

Info

Publication number
CN111104801B
CN111104801B CN201911367979.5A CN201911367979A CN111104801B CN 111104801 B CN111104801 B CN 111104801B CN 201911367979 A CN201911367979 A CN 201911367979A CN 111104801 B CN111104801 B CN 111104801B
Authority
CN
China
Prior art keywords
word
domain name
website domain
result
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911367979.5A
Other languages
Chinese (zh)
Other versions
CN111104801A (en
Inventor
杜韬
李依谦
曲守宁
朱连江
王信堂
王希普
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN201911367979.5A priority Critical patent/CN111104801B/en
Publication of CN111104801A publication Critical patent/CN111104801A/en
Application granted granted Critical
Publication of CN111104801B publication Critical patent/CN111104801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Abstract

The application discloses a text word segmentation method, a system, equipment and a medium based on a website domain name, which comprise the following steps: data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name; carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process; performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library; matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.

Description

Text word segmentation method, system, equipment and medium based on website domain name
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a method, a system, a device, and a medium for text word segmentation based on web site domain names.
Background
The statements in this section merely mention background art related to the present disclosure and do not necessarily constitute prior art. The present disclosure is premised on not tracking user behavior and not obtaining user privacy.
In recent years, the internet has become one of the most important infrastructures for human society, and has increasingly widely and deeply influenced economic and social activities of people. For a user, the jump among different websites can be regarded as the behavior track of the user, and the website domain name in the huge amount of internet surfing behavior data generated by the jump is the most representative, and the jump comprises the name and the property of the webpage browsed by the user, and can fully reflect the preference among websites and the relevance among corresponding websites of the user.
The website domain name mainly comprises English letters, arabic numerals, some special characters of "_", "@", "/", and the like, and aims to facilitate memorizing and communicating addresses (website, email, FTP, and the like) of a group of servers.
In the process of implementing the present disclosure, the inventor finds that the following technical problems exist in the prior art:
first: the length of the website domain name is extremely short, and the keyword cannot be effectively extracted by the existing word segmentation technology.
Second,: web site domain names are irregular unstructured text, making it difficult to extract satisfactory refined, understandable knowledge from them, and later vectorize the text.
Third,: when each company, organization or individual sets up own website domain name, naming will be carried out according to personal habit, and domain name abbreviations, misspellings, inconsistent languages and the like will often occur.
Fourth,: the web mining is carried out on the existing website domain name, the complexity of time and space is too high, and dimension disasters are easy to cause.
These problems can cause the data analyst to be unable to quickly obtain the property information of the web page from the web site domain name, thereby affecting the accuracy and efficiency of analyzing the user's surfing behavior.
Disclosure of Invention
In order to solve the defects in the prior art, the present disclosure provides a text word segmentation method, a system, a device and a medium based on website domain names; the method can carry out text analysis on the existing arbitrary website domain name, and can realize the technology of extracting the keywords in the website domain name with higher accuracy.
In a first aspect, the present disclosure provides a text word segmentation method based on a website domain name;
the text word segmentation method based on the website domain name comprises the following steps:
data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name;
carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;
performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;
matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.
In a second aspect, the present disclosure further provides a text word segmentation system based on a website domain name;
a text word segmentation system based on web site domain name, comprising:
a data acquisition module configured to: collecting a plurality of website domain names; word segmentation processing is carried out on each website domain name;
a text formatting module configured to: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;
a lexical reduction module configured to: performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;
a match output module configured to: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.
In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.
Compared with the prior art, the beneficial effects of the present disclosure are:
the method can be used for removing the redundant domain names, meaningless identifications and other information of companies, organizations or individuals when naming the websites of the companies, organizations or individuals more quickly; the situation of misspelling of the modified domain name with higher accuracy; and the main information in the domain name can be segmented more efficiently and more pertinently by combining the personalized word stock and the official dictionary. And reliable preparation is provided for the vectorization work of the website domain name in the next online behavior analysis. Under the condition that analysis rules are required to be analyzed from behavior tracks of a huge amount of users, the traditional method that the original analysis user surfing behavior needs to be recorded and loaded one by one and then manually classified according to webpage properties is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.
FIG. 1 is a flow chart of a method of a first embodiment;
FIG. 2 is a random piece of original data after data acquisition according to the first embodiment;
fig. 3 is a piece of data processed by the word segmentation technique based on the very small text of the web site domain name according to the first embodiment.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
An embodiment I provides a text word segmentation method based on a website domain name;
as shown in fig. 1, the text word segmentation method based on the website domain name comprises the following steps:
s1: data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name;
s2: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;
s3: performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;
s4: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.
As one or more embodiments, in the step S1, data is collected, and a plurality of website domain names are collected; the method comprises the following specific steps:
and collecting a plurality of website domain names, removing set sensitive words from each website domain name, and storing the website domain names with the sensitive words removed according to time units into a data set S.
As one or more embodiments, after the step of collecting the plurality of website domain names, before the step of word segmentation processing is performed on each website domain name, the method further includes: a data preprocessing step; the data preprocessing step comprises the following steps:
s101: deleting the missing value or complementing the missing value of each website domain name in the data set S;
s102: and extracting the website domain name to a column vector by taking the user as a unit.
It should be understood that after the step of collecting a plurality of website domain names, before the step of word segmentation processing is performed on each website domain name, the method further includes: a data preprocessing step; the data preprocessing step comprises the following steps:
carrying out data preprocessing and denoising processing on the data set S, and if the attribute only contains a very small amount of missing values, deleting the missing values; if the attribute contains a partial missing value, the same type of mean interpolation method can be used for complement.
In the text segmentation operation for the data, the original data is shown in fig. 2, which contains information such as a server and a user terminal, and for the analysis of the user surfing behavior, we need to distinguish the text by some marks, and extract the domain name of the browsing website to the column vector L according to each user as a unit 1
As one or more embodiments, in the S1, word segmentation is performed on each website domain name; the method comprises the following specific steps:
and performing word segmentation processing on each website domain name by utilizing a jieba word segmentation tool.
It should be understood that in the step S1, word segmentation is performed on each website domain name; the method comprises the following specific steps:
efficient word graph scanning is realized based on a Trie structure, a Directed Acyclic Graph (DAG) formed by all possible word formation conditions of Chinese and English in a sentence is generated, a maximum probability path is searched by dynamic programming, a maximum segmentation combination based on word frequency is found, and a website domain name column vector L is obtained 1 Inputting a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which can be regarded as words and are contained in each record, and storing the character strings into a column vector L 2
As one or more embodiments, in S2, text formatting is performed on the word after the word segmentation; the method comprises the following specific steps:
and carrying out text formatting processing on the word subjected to word segmentation processing, and deleting the mark symbol and the set useless character.
It should be understood that, in the step S2, text formatting is performed on the word after word segmentation; the method comprises the following specific steps:
for column vector L 2 The text formatting operation is carried out, the sign symbol and useless characters are thoroughly deleted, and a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character string are used as sub-records and stored into a data set S 1 Is a kind of medium.
As one or more embodiments, in S2, analyzing word parts of speech of the word obtained after the text formatting process; the method comprises the following specific steps:
the part of speech of the current word is obtained based on the suffix information in the word.
It should be understood that in the step S2, word parts of speech of the word are obtained after the text formatting process is analyzed; the method comprises the following specific steps:
adopting a regular expression labeler, converting tagset into unified symbols by formulating tagset, utilizing information such as suffix and the like in English words to infer the part of speech of one word, and adopting a data set S 1 The sub-records in the database are matched in sequence, when all the sub-records are not matched, the sub-records are marked as part of speech with the highest probability, finally the sub-records are recorded according to a website domain name as a unit, each English word and the part of speech corresponding to each English word are used as sub-records, and the sub-records are stored in a data set S 2
As one or more embodiments, in S3, performing morphological reduction according to word parts of speech; the method comprises the following specific steps:
according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D 1
It should be understood that, in the step S3, morphological reduction is performed according to word parts of speech; the method comprises the following specific steps:
extracting data set S 2 The English words and the corresponding parts of speech in each sub-record call the WordNet function to perform the morphological reduction operation, the deformation of various types of words is normalized into one form, and the words are recorded according to the unit of a website domain name and stored in a data set S 3
As one or more embodiments, in S3, storing the shape-reduced result in a word stock; the method comprises the following specific steps:
user-built personalized word stock D 2 Word library D is completed in NLTK by using StanfordNLP toolkit 2 Is performed according to the operation of (1); taking personalized word stock D 2 And dictionary D 1 Generates word stock D3, d3=d1 u D2.
As one or more embodiments, in S4, matching the website domain name of the word to be segmented with the word library by using a bidirectional maximum matching algorithm; the method comprises the following specific steps:
matching the website domain name to be segmented with a word stock D3 by adopting a forward maximum matching algorithm, and recording a matching result R 1
Matching the website domain name to be segmented with a word stock D3 by adopting a reverse maximum matching algorithm, and recording a matching result R 2
If the result R is matched 1 Equal to the matching result R 2 Then select the matching result R 1 And the final word segmentation result of the website domain name to be segmented is used as a final word segmentation result.
Further, if the result R is matched 1 Is not equal to the matching result R 2 Selecting the result R of the forward maximum matching algorithm of the website domain name 1 Result R of reverse maximum matching algorithm with website domain name 2 The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched 3
It should be understood that in the step S4, matching the website domain name to be segmented with the word stock by adopting a bidirectional maximum matching algorithm; the method comprises the following specific steps:
firstly adopting a forward maximum matching algorithm of website domain names, and then combining the website domain names with a word stock D 3 Proceeding withAnd (3) comparison:
if the word is an English word, recording, otherwise, by adding a single word, continuing to compare from left to right until a single word is left, ending,
if the character string cannot be segmented, as unregistered processing, the processed website domain name is used as a unit to be matched with the word stock D again 3 If the records are matched correctly, the result R of the forward maximum matching algorithm of the website domain name is recorded 1
And then S is carried out 3 Adopting a reverse maximum matching algorithm of the website domain name and a word stock D 3 And (3) performing comparison:
if an English word is recorded, otherwise, the comparison is continued from right to left by reducing a single word until a single word is left,
if the character string cannot be segmented, as unregistered processing, the processed website domain name is used as a unit to be matched with the word stock D again 3 If the records are correctly matched, the result R of the reverse maximum matching algorithm of the website domain name is recorded 2
If R is 1 Equal to R 2 The result R of the forward maximum matching algorithm of the website domain name can be selected 1 For the final result R of the bidirectional maximum matching algorithm of the recorded website domain name 3
If match result R 1 Is not equal to the matching result R 2 Selecting the result R of the forward maximum matching algorithm of the website domain name 1 Result R of reverse maximum matching algorithm with website domain name 2 The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched 3
Will end result R 3 Store to data set S 4 Is a kind of medium.
As one or more embodiments, in S4, if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, which specifically includes the following steps:
if the word is to be segmentedIf the website domain name cannot be matched correctly, cleaning redundant character strings, returning to the bidirectional maximum matching algorithm again until all character strings of the website domain name to be segmented are matched correctly with the word stock D 3 And complete the storage to the data set S 4 The operation of (2) is terminated; the final resulting dataset S 4 The word segmentation result of the website domain name to be segmented is obtained.
As can be seen from fig. 2, there are several interference terms that may occur in the domain name web site, such as: dldir1, has no practical meaning for such samples and needs to be washed away; there is word combination concatenation, such as: aiming at the samples of continuous writing of a plurality of words and inclusion of shorthand and misspelling, useful words are needed to be selected, the words without meaning are removed, and the shorthand and misspelling words are restored with the maximum probability;
mixed naming of character identifiers, such as: 80002486_fa55fa1d3a4b43bab792c6a8ff463f72.Zip, wrd _template_head_06281609, for such samples, it is necessary to delete the identifier and extract meaningful words in the sample, restore the temporal, passive, etc. transformations of the words, and the file suffix needs to be set with higher weight because it has higher degree of recognition in terms of distinguishing properties.
FIG. 3 is a piece of data processed by a word segmentation technique based on very small text of a web site domain name.
TABLE 1 case 1
Table 2 case 2
TABLE 3 case 3
TABLE 4 case 4
TABLE 5 case 5
The second embodiment also provides a text word segmentation system based on the website domain name;
a text word segmentation system based on web site domain name, comprising:
a data acquisition module configured to: collecting a plurality of website domain names; word segmentation processing is carried out on each website domain name;
a text formatting module configured to: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;
a lexical reduction module configured to: performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;
a match output module configured to: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.
In a third embodiment, the present embodiment further provides an electronic device including a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first embodiment.
In a fourth embodiment, the present embodiment further provides a computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the method of the first embodiment.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (7)

1. The text word segmentation method based on the website domain name is characterized by comprising the following steps:
data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name, wherein the website domain name is extracted and browsed to a column vector L1 according to each user as a unit; based on the Trie structure, realizing efficient word graph scanning, generating a directed acyclic graph formed by all word forming conditions of Chinese and English in sentences, adopting dynamic programming to search a maximum probability path, finding out a maximum segmentation combination based on word frequency, inputting a website domain name column vector L1 into a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which are regarded as words and are contained in each record, and storing the character strings into a column vector L2;
carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process, wherein text formatting operation is carried out on the column vector L2, sign symbols and useless characters are thoroughly deleted, a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character strings are used as sub-records and stored in a data set S1;
adopting a regular expression labeler, converting tagset into unified symbols by formulating, utilizing suffix information in English words to infer the part of speech of a word, and collecting data S 1 The sub-records in the table are matched in sequence, and when all the sub-records are not matched, the sub-records are marked as probabilityThe largest part of speech is recorded according to a website domain name as a unit, each English word and the part of speech corresponding to the English word are recorded as sub-records and stored in a data set S 2
Performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library, specifically: according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D 1
User-built personalized word stock D 2 Word library D is completed in NLTK by using StanfordNLP toolkit 2 Is performed according to the operation of (1); taking personalized word stock D 2 And dictionary D 1 Generates word stock by union of (1),/>
Matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, wherein the method specifically comprises the following steps: the website domain name to be segmented adopts a forward maximum matching algorithm and a word stockMatching is carried out, and matching results are recorded>The method comprises the steps of carrying out a first treatment on the surface of the The website domain name to be segmented adopts a reverse maximum matching algorithm and a word stock +.>Matching is carried out, and matching results are recorded>
If the result is matchedEqual to match result->Then select the matching result +.>As a final word segmentation result of the website domain name to be segmented;
if the matching result is thatUnequal match results->Selecting the result R of the forward maximum matching algorithm of the website domain name 1 Result R of reverse maximum matching algorithm with website domain name 2 The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched 3
2. The method of claim 1, wherein data collection is performed to collect a plurality of web site names; the method comprises the following specific steps:
and collecting a plurality of website domain names, removing set sensitive words from each website domain name, and storing the website domain names with the sensitive words removed according to time units into a data set S.
3. The method of claim 1, wherein after the step of collecting a plurality of web site domain names, before the step of word segmentation for each web site domain name, further comprises: a data preprocessing step; the data preprocessing step comprises the following steps:
s101: deleting the missing value or complementing the missing value of each website domain name in the data set S;
s102: and extracting the website domain name to a column vector by taking the user as a unit.
4. The method of claim 1, wherein each web site domain name is subjected to word segmentation; the method comprises the following specific steps: and performing word segmentation processing on each website domain name by utilizing a jieba word segmentation tool.
5. The text word segmentation system based on the website domain name is characterized by comprising the following components:
a data acquisition module configured to: collecting a plurality of website domain names; word segmentation processing is carried out on each website domain name, wherein the website domain name is extracted and browsed to a column vector L1 according to each user as a unit; based on the Trie structure, realizing efficient word graph scanning, generating a directed acyclic graph formed by all word forming conditions of Chinese and English in sentences, adopting dynamic programming to search a maximum probability path, finding out a maximum segmentation combination based on word frequency, inputting a website domain name column vector L1 into a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which are regarded as words and are contained in each record, and storing the character strings into a column vector L2;
a text formatting module configured to: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process, wherein text formatting operation is carried out on the column vector L2, sign symbols and useless characters are thoroughly deleted, a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character strings are used as sub-records and stored in a data set S1;
adopting a regular expression labeler, converting tagset into unified symbols by formulating, utilizing suffix information in English words to infer the part of speech of a word, and collecting data S 1 The sub-records in the database are matched in sequence, when all the sub-records are not matched, the sub-records are marked as part of speech with the highest probability, finally the sub-records are recorded according to a website domain name as a unit, each English word and the part of speech corresponding to each English word are used as sub-records, and the sub-records are stored in a data set S 2
A lexical reduction module configured to: performing morphological reduction according to word parts of speech; knot for restoring word shapeThe results are stored in a word stock, specifically: according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D 1
User-built personalized word stock D 2 Word library D is completed in NLTK by using StanfordNLP toolkit 2 Is performed according to the operation of (1); taking personalized word stock D 2 And dictionary D 1 Generates word stock by union of (1),/>
A match output module configured to: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, wherein the method specifically comprises the following steps: the website domain name to be segmented adopts a forward maximum matching algorithm and a word stockMatching is carried out, and matching results are recorded>The method comprises the steps of carrying out a first treatment on the surface of the The website domain name to be segmented adopts a reverse maximum matching algorithm and a word stock +.>Matching is carried out, and matching results are recorded>
If the result is matchedEqual to match result->Then select the matching result +.>As a final word segmentation result of the website domain name to be segmented;
if the matching result is thatUnequal match results->Selecting the result R of the forward maximum matching algorithm of the website domain name 1 Result R of reverse maximum matching algorithm with website domain name 2 The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched 3
6. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of any of claims 1-4.
7. A computer readable storage medium storing computer instructions which, when executed by a processor, cause the steps of the method of any one of claims 1-4.
CN201911367979.5A 2019-12-26 2019-12-26 Text word segmentation method, system, equipment and medium based on website domain name Active CN111104801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911367979.5A CN111104801B (en) 2019-12-26 2019-12-26 Text word segmentation method, system, equipment and medium based on website domain name

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911367979.5A CN111104801B (en) 2019-12-26 2019-12-26 Text word segmentation method, system, equipment and medium based on website domain name

Publications (2)

Publication Number Publication Date
CN111104801A CN111104801A (en) 2020-05-05
CN111104801B true CN111104801B (en) 2023-09-26

Family

ID=70424414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911367979.5A Active CN111104801B (en) 2019-12-26 2019-12-26 Text word segmentation method, system, equipment and medium based on website domain name

Country Status (1)

Country Link
CN (1) CN111104801B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992376A (en) * 2021-03-04 2021-06-18 山东大学 Disease name matching method and system based on weight adjustment
CN113095050A (en) * 2021-04-19 2021-07-09 广东电网有限责任公司 Intelligent ticketing method, system, equipment and storage medium
CN113645240B (en) * 2021-08-11 2023-05-23 积至(海南)信息技术有限公司 Malicious domain name community mining method based on graph structure
CN113806477A (en) * 2021-08-26 2021-12-17 广东广信通信服务有限公司 Automatic text labeling method, device, terminal and storage medium
CN116579344B (en) * 2023-07-12 2023-10-20 吉奥时空信息技术股份有限公司 Case main body extraction method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text
CN108228710A (en) * 2017-11-30 2018-06-29 中国科学院信息工程研究所 A kind of segmenting method and device for URL
CN108509419A (en) * 2018-03-21 2018-09-07 山东中医药大学 Ancient TCM books document participle and part of speech indexing method and system
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109344263A (en) * 2018-08-01 2019-02-15 昆明理工大学 A kind of address matching method
CN110457466A (en) * 2019-06-28 2019-11-15 谭浩 Generate method, computer readable storage medium and the terminal device of interview report

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text
CN108228710A (en) * 2017-11-30 2018-06-29 中国科学院信息工程研究所 A kind of segmenting method and device for URL
CN108509419A (en) * 2018-03-21 2018-09-07 山东中医药大学 Ancient TCM books document participle and part of speech indexing method and system
CN109344263A (en) * 2018-08-01 2019-02-15 昆明理工大学 A kind of address matching method
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN110457466A (en) * 2019-06-28 2019-11-15 谭浩 Generate method, computer readable storage medium and the terminal device of interview report

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
党倩娜.数据预处理与文本分词.《新兴技术弱信号监测机制研究》.2018,第89-92页. *

Also Published As

Publication number Publication date
CN111104801A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN111104801B (en) Text word segmentation method, system, equipment and medium based on website domain name
Nayak et al. Survey on pre-processing techniques for text mining
US7461056B2 (en) Text mining apparatus and associated methods
Huston et al. Evaluating verbose query processing techniques
US7424421B2 (en) Word collection method and system for use in word-breaking
Ladani et al. Stopword identification and removal techniques on tc and ir applications: A survey
US20050251384A1 (en) Word extraction method and system for use in word-breaking
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN103678412A (en) Document retrieval method and device
TWI656450B (en) Method and system for extracting knowledge from Chinese corpus
Albishre et al. Effective 20 newsgroups dataset cleaning
CN104346382B (en) Use the text analysis system and method for language inquiry
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
Osman et al. Stemming Tigrinya words for information retrieval
Govilkar et al. Extraction of root words using morphological analyzer for devanagari script
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Elakiya et al. Designing preprocessing framework (ERT) for text mining application
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia
Husain et al. A language Independent Approach to develop Urdu stemmer
Patil et al. Inflectional and derivational hybrid stemmer for sentiment analysis: a case study with Marathi tweets
Hajjem et al. Building comparable corpora from social networks
JP4148247B2 (en) Vocabulary acquisition method and apparatus, program, and computer-readable recording medium
TWI534640B (en) Chinese network information monitoring and analysis system and its method
Simo et al. Regrets: A new corpus of regrettable (self-) disclosures on social media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant