CN111104801B

CN111104801B - Text word segmentation method, system, equipment and medium based on website domain name

Info

Publication number: CN111104801B
Application number: CN201911367979.5A
Authority: CN
Inventors: 杜韬; 李依谦; 曲守宁; 朱连江; 王信堂; 王希普
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-09-26
Anticipated expiration: 2039-12-26
Also published as: CN111104801A

Abstract

The application discloses a text word segmentation method, a system, equipment and a medium based on a website domain name, which comprise the following steps: data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name; carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process; performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library; matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.

Description

Text word segmentation method, system, equipment and medium based on website domain name

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method, a system, a device, and a medium for text word segmentation based on web site domain names.

Background

The statements in this section merely mention background art related to the present disclosure and do not necessarily constitute prior art. The present disclosure is premised on not tracking user behavior and not obtaining user privacy.

In recent years, the internet has become one of the most important infrastructures for human society, and has increasingly widely and deeply influenced economic and social activities of people. For a user, the jump among different websites can be regarded as the behavior track of the user, and the website domain name in the huge amount of internet surfing behavior data generated by the jump is the most representative, and the jump comprises the name and the property of the webpage browsed by the user, and can fully reflect the preference among websites and the relevance among corresponding websites of the user.

The website domain name mainly comprises English letters, arabic numerals, some special characters of "_", "@", "/", and the like, and aims to facilitate memorizing and communicating addresses (website, email, FTP, and the like) of a group of servers.

In the process of implementing the present disclosure, the inventor finds that the following technical problems exist in the prior art:

first: the length of the website domain name is extremely short, and the keyword cannot be effectively extracted by the existing word segmentation technology.

Second,: web site domain names are irregular unstructured text, making it difficult to extract satisfactory refined, understandable knowledge from them, and later vectorize the text.

Third,: when each company, organization or individual sets up own website domain name, naming will be carried out according to personal habit, and domain name abbreviations, misspellings, inconsistent languages and the like will often occur.

Fourth,: the web mining is carried out on the existing website domain name, the complexity of time and space is too high, and dimension disasters are easy to cause.

These problems can cause the data analyst to be unable to quickly obtain the property information of the web page from the web site domain name, thereby affecting the accuracy and efficiency of analyzing the user's surfing behavior.

Disclosure of Invention

In order to solve the defects in the prior art, the present disclosure provides a text word segmentation method, a system, a device and a medium based on website domain names; the method can carry out text analysis on the existing arbitrary website domain name, and can realize the technology of extracting the keywords in the website domain name with higher accuracy.

In a first aspect, the present disclosure provides a text word segmentation method based on a website domain name;

the text word segmentation method based on the website domain name comprises the following steps:

data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name;

carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;

performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;

matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.

In a second aspect, the present disclosure further provides a text word segmentation system based on a website domain name;

a text word segmentation system based on web site domain name, comprising:

a data acquisition module configured to: collecting a plurality of website domain names; word segmentation processing is carried out on each website domain name;

a text formatting module configured to: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;

a lexical reduction module configured to: performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;

a match output module configured to: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

Compared with the prior art, the beneficial effects of the present disclosure are:

the method can be used for removing the redundant domain names, meaningless identifications and other information of companies, organizations or individuals when naming the websites of the companies, organizations or individuals more quickly; the situation of misspelling of the modified domain name with higher accuracy; and the main information in the domain name can be segmented more efficiently and more pertinently by combining the personalized word stock and the official dictionary. And reliable preparation is provided for the vectorization work of the website domain name in the next online behavior analysis. Under the condition that analysis rules are required to be analyzed from behavior tracks of a huge amount of users, the traditional method that the original analysis user surfing behavior needs to be recorded and loaded one by one and then manually classified according to webpage properties is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.

FIG. 1 is a flow chart of a method of a first embodiment;

FIG. 2 is a random piece of original data after data acquisition according to the first embodiment;

fig. 3 is a piece of data processed by the word segmentation technique based on the very small text of the web site domain name according to the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

An embodiment I provides a text word segmentation method based on a website domain name;

as shown in fig. 1, the text word segmentation method based on the website domain name comprises the following steps:

s1: data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name;

s2: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process;

s3: performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library;

s4: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm.

As one or more embodiments, in the step S1, data is collected, and a plurality of website domain names are collected; the method comprises the following specific steps:

and collecting a plurality of website domain names, removing set sensitive words from each website domain name, and storing the website domain names with the sensitive words removed according to time units into a data set S.

As one or more embodiments, after the step of collecting the plurality of website domain names, before the step of word segmentation processing is performed on each website domain name, the method further includes: a data preprocessing step; the data preprocessing step comprises the following steps:

s101: deleting the missing value or complementing the missing value of each website domain name in the data set S;

s102: and extracting the website domain name to a column vector by taking the user as a unit.

It should be understood that after the step of collecting a plurality of website domain names, before the step of word segmentation processing is performed on each website domain name, the method further includes: a data preprocessing step; the data preprocessing step comprises the following steps:

carrying out data preprocessing and denoising processing on the data set S, and if the attribute only contains a very small amount of missing values, deleting the missing values; if the attribute contains a partial missing value, the same type of mean interpolation method can be used for complement.

In the text segmentation operation for the data, the original data is shown in fig. 2, which contains information such as a server and a user terminal, and for the analysis of the user surfing behavior, we need to distinguish the text by some marks, and extract the domain name of the browsing website to the column vector L according to each user as a unit ₁ 。

As one or more embodiments, in the S1, word segmentation is performed on each website domain name; the method comprises the following specific steps:

and performing word segmentation processing on each website domain name by utilizing a jieba word segmentation tool.

It should be understood that in the step S1, word segmentation is performed on each website domain name; the method comprises the following specific steps:

efficient word graph scanning is realized based on a Trie structure, a Directed Acyclic Graph (DAG) formed by all possible word formation conditions of Chinese and English in a sentence is generated, a maximum probability path is searched by dynamic programming, a maximum segmentation combination based on word frequency is found, and a website domain name column vector L is obtained ₁ Inputting a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which can be regarded as words and are contained in each record, and storing the character strings into a column vector L ₂ 。

As one or more embodiments, in S2, text formatting is performed on the word after the word segmentation; the method comprises the following specific steps:

and carrying out text formatting processing on the word subjected to word segmentation processing, and deleting the mark symbol and the set useless character.

It should be understood that, in the step S2, text formatting is performed on the word after word segmentation; the method comprises the following specific steps:

for column vector L ₂ The text formatting operation is carried out, the sign symbol and useless characters are thoroughly deleted, and a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character string are used as sub-records and stored into a data set S ₁ Is a kind of medium.

As one or more embodiments, in S2, analyzing word parts of speech of the word obtained after the text formatting process; the method comprises the following specific steps:

the part of speech of the current word is obtained based on the suffix information in the word.

It should be understood that in the step S2, word parts of speech of the word are obtained after the text formatting process is analyzed; the method comprises the following specific steps:

adopting a regular expression labeler, converting tagset into unified symbols by formulating tagset, utilizing information such as suffix and the like in English words to infer the part of speech of one word, and adopting a data set S ₁ The sub-records in the database are matched in sequence, when all the sub-records are not matched, the sub-records are marked as part of speech with the highest probability, finally the sub-records are recorded according to a website domain name as a unit, each English word and the part of speech corresponding to each English word are used as sub-records, and the sub-records are stored in a data set S ₂ 。

As one or more embodiments, in S3, performing morphological reduction according to word parts of speech; the method comprises the following specific steps:

according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D ₁ 。

It should be understood that, in the step S3, morphological reduction is performed according to word parts of speech; the method comprises the following specific steps:

extracting data set S ₂ The English words and the corresponding parts of speech in each sub-record call the WordNet function to perform the morphological reduction operation, the deformation of various types of words is normalized into one form, and the words are recorded according to the unit of a website domain name and stored in a data set S ₃ 。

As one or more embodiments, in S3, storing the shape-reduced result in a word stock; the method comprises the following specific steps:

user-built personalized word stock D ₂ Word library D is completed in NLTK by using StanfordNLP toolkit ₂ Is performed according to the operation of (1); taking personalized word stock D ₂ And dictionary D ₁ Generates word stock D3, d3=d1 u D2.

As one or more embodiments, in S4, matching the website domain name of the word to be segmented with the word library by using a bidirectional maximum matching algorithm; the method comprises the following specific steps:

matching the website domain name to be segmented with a word stock D3 by adopting a forward maximum matching algorithm, and recording a matching result R ₁ ；

Matching the website domain name to be segmented with a word stock D3 by adopting a reverse maximum matching algorithm, and recording a matching result R ₂ ；

If the result R is matched ₁ Equal to the matching result R ₂ Then select the matching result R ₁ And the final word segmentation result of the website domain name to be segmented is used as a final word segmentation result.

Further, if the result R is matched ₁ Is not equal to the matching result R ₂ Selecting the result R of the forward maximum matching algorithm of the website domain name ₁ Result R of reverse maximum matching algorithm with website domain name ₂ The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched ₃ 。

It should be understood that in the step S4, matching the website domain name to be segmented with the word stock by adopting a bidirectional maximum matching algorithm; the method comprises the following specific steps:

firstly adopting a forward maximum matching algorithm of website domain names, and then combining the website domain names with a word stock D ₃ Proceeding withAnd (3) comparison:

if the word is an English word, recording, otherwise, by adding a single word, continuing to compare from left to right until a single word is left, ending,

if the character string cannot be segmented, as unregistered processing, the processed website domain name is used as a unit to be matched with the word stock D again ₃ If the records are matched correctly, the result R of the forward maximum matching algorithm of the website domain name is recorded ₁ ；

And then S is carried out ₃ Adopting a reverse maximum matching algorithm of the website domain name and a word stock D ₃ And (3) performing comparison:

if an English word is recorded, otherwise, the comparison is continued from right to left by reducing a single word until a single word is left,

if the character string cannot be segmented, as unregistered processing, the processed website domain name is used as a unit to be matched with the word stock D again ₃ If the records are correctly matched, the result R of the reverse maximum matching algorithm of the website domain name is recorded ₂ 。

If R is ₁ Equal to R ₂ The result R of the forward maximum matching algorithm of the website domain name can be selected ₁ For the final result R of the bidirectional maximum matching algorithm of the recorded website domain name ₃ ；

If match result R ₁ Is not equal to the matching result R ₂ Selecting the result R of the forward maximum matching algorithm of the website domain name ₁ Result R of reverse maximum matching algorithm with website domain name ₂ The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched ₃ ；

Will end result R ₃ Store to data set S ₄ Is a kind of medium.

As one or more embodiments, in S4, if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, which specifically includes the following steps:

if the word is to be segmentedIf the website domain name cannot be matched correctly, cleaning redundant character strings, returning to the bidirectional maximum matching algorithm again until all character strings of the website domain name to be segmented are matched correctly with the word stock D ₃ And complete the storage to the data set S ₄ The operation of (2) is terminated; the final resulting dataset S ₄ The word segmentation result of the website domain name to be segmented is obtained.

As can be seen from fig. 2, there are several interference terms that may occur in the domain name web site, such as: dldir1, has no practical meaning for such samples and needs to be washed away; there is word combination concatenation, such as: aiming at the samples of continuous writing of a plurality of words and inclusion of shorthand and misspelling, useful words are needed to be selected, the words without meaning are removed, and the shorthand and misspelling words are restored with the maximum probability;

mixed naming of character identifiers, such as: 80002486_fa55fa1d3a4b43bab792c6a8ff463f72.Zip, wrd _template_head_06281609, for such samples, it is necessary to delete the identifier and extract meaningful words in the sample, restore the temporal, passive, etc. transformations of the words, and the file suffix needs to be set with higher weight because it has higher degree of recognition in terms of distinguishing properties.

FIG. 3 is a piece of data processed by a word segmentation technique based on very small text of a web site domain name.

TABLE 1 case 1

Table 2 case 2

TABLE 3 case 3

TABLE 4 case 4

TABLE 5 case 5

The second embodiment also provides a text word segmentation system based on the website domain name;

a text word segmentation system based on web site domain name, comprising:

In a third embodiment, the present embodiment further provides an electronic device including a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first embodiment.

In a fourth embodiment, the present embodiment further provides a computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the method of the first embodiment.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The text word segmentation method based on the website domain name is characterized by comprising the following steps:

data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name, wherein the website domain name is extracted and browsed to a column vector L1 according to each user as a unit; based on the Trie structure, realizing efficient word graph scanning, generating a directed acyclic graph formed by all word forming conditions of Chinese and English in sentences, adopting dynamic programming to search a maximum probability path, finding out a maximum segmentation combination based on word frequency, inputting a website domain name column vector L1 into a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which are regarded as words and are contained in each record, and storing the character strings into a column vector L2;

carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process, wherein text formatting operation is carried out on the column vector L2, sign symbols and useless characters are thoroughly deleted, a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character strings are used as sub-records and stored in a data set S1;

adopting a regular expression labeler, converting tagset into unified symbols by formulating, utilizing suffix information in English words to infer the part of speech of a word, and collecting data S ₁ The sub-records in the table are matched in sequence, and when all the sub-records are not matched, the sub-records are marked as probabilityThe largest part of speech is recorded according to a website domain name as a unit, each English word and the part of speech corresponding to the English word are recorded as sub-records and stored in a data set S ₂ ；

Performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library, specifically: according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D ₁ ；

User-built personalized word stock D ₂ Word library D is completed in NLTK by using StanfordNLP toolkit ₂ Is performed according to the operation of (1); taking personalized word stock D ₂ And dictionary D ₁ Generates word stock by union of (1)，/>；

Matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, wherein the method specifically comprises the following steps: the website domain name to be segmented adopts a forward maximum matching algorithm and a word stockMatching is carried out, and matching results are recorded>The method comprises the steps of carrying out a first treatment on the surface of the The website domain name to be segmented adopts a reverse maximum matching algorithm and a word stock +.>Matching is carried out, and matching results are recorded>；

If the result is matchedEqual to match result->Then select the matching result +.>As a final word segmentation result of the website domain name to be segmented;

if the matching result is thatUnequal match results->Selecting the result R of the forward maximum matching algorithm of the website domain name ₁ Result R of reverse maximum matching algorithm with website domain name ₂ The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched ₃ 。

2. The method of claim 1, wherein data collection is performed to collect a plurality of web site names; the method comprises the following specific steps:

3. The method of claim 1, wherein after the step of collecting a plurality of web site domain names, before the step of word segmentation for each web site domain name, further comprises: a data preprocessing step; the data preprocessing step comprises the following steps:

4. The method of claim 1, wherein each web site domain name is subjected to word segmentation; the method comprises the following specific steps: and performing word segmentation processing on each website domain name by utilizing a jieba word segmentation tool.

5. The text word segmentation system based on the website domain name is characterized by comprising the following components:

a data acquisition module configured to: collecting a plurality of website domain names; word segmentation processing is carried out on each website domain name, wherein the website domain name is extracted and browsed to a column vector L1 according to each user as a unit; based on the Trie structure, realizing efficient word graph scanning, generating a directed acyclic graph formed by all word forming conditions of Chinese and English in sentences, adopting dynamic programming to search a maximum probability path, finding out a maximum segmentation combination based on word frequency, inputting a website domain name column vector L1 into a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which are regarded as words and are contained in each record, and storing the character strings into a column vector L2;

a text formatting module configured to: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process, wherein text formatting operation is carried out on the column vector L2, sign symbols and useless characters are thoroughly deleted, a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character strings are used as sub-records and stored in a data set S1;

adopting a regular expression labeler, converting tagset into unified symbols by formulating, utilizing suffix information in English words to infer the part of speech of a word, and collecting data S ₁ The sub-records in the database are matched in sequence, when all the sub-records are not matched, the sub-records are marked as part of speech with the highest probability, finally the sub-records are recorded according to a website domain name as a unit, each English word and the part of speech corresponding to each English word are used as sub-records, and the sub-records are stored in a data set S ₂ ；

A lexical reduction module configured to: performing morphological reduction according to word parts of speech; knot for restoring word shapeThe results are stored in a word stock, specifically: according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D ₁ ；

A match output module configured to: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, wherein the method specifically comprises the following steps: the website domain name to be segmented adopts a forward maximum matching algorithm and a word stockMatching is carried out, and matching results are recorded>The method comprises the steps of carrying out a first treatment on the surface of the The website domain name to be segmented adopts a reverse maximum matching algorithm and a word stock +.>Matching is carried out, and matching results are recorded>；

6. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of any of claims 1-4.

7. A computer readable storage medium storing computer instructions which, when executed by a processor, cause the steps of the method of any one of claims 1-4.