CN101149739A

CN101149739A - Internet faced sensing string digging method and system

Info

Publication number: CN101149739A
Application number: CNA2007101207555A
Authority: CN
Inventors: 张华平; 贺敏; 黄玉兰; 龚才春
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2007-08-24
Filing date: 2007-08-24
Publication date: 2008-03-26

Abstract

The invention discloses an Internet-oriented meaningful excavating method and system. The method includes the following steps: step A, repeat character string discovery; steps B, filter the character string through analysis the context; steps C, analysis and filter character string through language model. It can effectively extract net page or meaningful string in large scale of text data.

Description

Internet-oriented meaningful string mining method and system

Technical Field

The invention relates to the field of information retrieval and the field of operating systems, in particular to a method and a system for mining a meaningful string oriented to the Internet.

Background

The Web users are difficult to effectively acquire useful information from the information on the internet in huge amount, and the users often feel unknown about how to seek really wanted information from massive information, how to acquire or grasp key information in the massive information and grasp current important information in time when facing information updated day and night like the vast ocean in Wang. Meanwhile, any person cannot see six ways and hear eight directions with ears in the face of new information which is constantly emerging all the time. At this time, strong support of natural language processing technology is more urgently needed to cope with the increasingly serious information overload problem.

Extracting useful key information from massive network information becomes a difficult problem, and becomes a demand to be solved urgently in the era of network information explosion. The solution of the problem also has wide application prospect: for individuals, the current important information can be more conveniently found and organized through the system, and the system can become an entry point for people to control mass information. For enterprises, the latest dynamics of related fields of the enterprises, the development direction of strategy partners and the latest actions of competitors can be mastered in time through the system, and the system provides information help for the enterprises to make strategy policies. For the country, the system can be used for knowing current social important events, popular trends, public opinion directions and the like, becomes an information window for knowing and mastering social conditions, and provides help for making related decisions.

Under the background, how to extract useful information in the web text highlights the importance of the web text, and becomes a direction worthy of intensive research.

Disclosure of Invention

The invention aims to provide a method and a system for mining meaningful strings facing the Internet, which can effectively extract the meaningful strings in a webpage or large-scale text data.

The invention provides an Internet-oriented meaningful string mining method, which comprises the following steps:

step A, repeating character string discovery;

step B, filtering the character string through context adjacency analysis;

and C, analyzing and filtering the character strings through a language model.

The step A comprises the following steps:

step A1, processing the webpage linguistic data to obtain formatted plain text files, classifying the text files, recording character strings which repeatedly appear in the text and the occurrence frequency of the character strings, and filtering the character strings of which the occurrence frequency is less than a certain threshold value.

The step B comprises the following steps:

and B1, calculating context adjacent characteristic quantities of each repeated string, judging whether the characteristic quantities reach a set threshold value, and filtering out text strings which do not reach the threshold value according to a judgment result.

The step C comprises the following steps:

and step C1, scanning adjacent character pairs of the text string character by character, searching the coupling degree of the adjacent character pairs, filtering the text string according to the coupling degree, and further filtering according to the position word forming probability of the text string to obtain the meaningful string.

The step A1 comprises the following steps:

a11, processing the webpage corpus to obtain a formatted plain text file, and then converting the Chinese characters into corresponding IDs;

step A12, establishing indexes for the processed ID sequences, starting to expand from the information of each single character index to obtain all repeated strings, continuously expanding to obtain long strings after the newly generated repeated strings are written into a file, repeatedly iterating until interval symbols appear or the length reaches a specified threshold value, and stopping expansion;

and step A13, recording the adjacent word information and the document information of each string, and independently storing each type of information in a file.

The step B1 comprises the following steps:

step B11, calculating context adjacent characteristic quantities of each repeated string, and judging whether the characteristic quantities reach a set threshold value;

step B12, if the threshold value is reached, the step C is carried out;

and step B13, if the characteristic quantity does not reach the threshold value, filtering the characteristic quantity.

The step C1 comprises the following steps:

step C11, labeling a part of the training corpus to generate a coupling degree dictionary of adjacent words and a word position word forming probability dictionary;

step C12, scanning adjacent word pairs word by word, and searching the coupling degree of the adjacent word pairs;

step C13, when the coupling degree of the adjacent character pairs is smaller than a set threshold value, the adjacent character pairs do not form a part of the characters and are used as garbage strings for filtering;

step C14, searching the character string of adjacent character pairs which is not filtered, searching the position word forming probability of the single character, and judging whether the head and the tail of the character string contain common functional characters;

step C15, if the character is a functional character, filtering the functional character;

and step C16, determining the characters which are not filtered out as meaningful strings.

The invention also provides an Internet-oriented meaningful string mining system, which comprises:

the repeated string finding module is used for processing the webpage linguistic data to obtain formatted plain text files, classifying the text files, recording character strings which repeatedly appear in the text and the appearance frequency of the character strings, and filtering out the character strings of which the appearance frequency is less than a certain threshold value;

the context adjacency analysis module is used for calculating the context adjacency characteristic quantity of each repeated string, judging whether the characteristic quantity reaches a set threshold value or not, and filtering out the text strings which do not reach the threshold value according to the judgment result;

and the statistical language model analysis module is used for scanning adjacent character pairs of the text string character by character, searching the coupling degree of the adjacent character pairs, and filtering the text string according to the coupling degree to obtain the meaningful string.

And the statistical language model analysis module is also used for further filtering the character strings to obtain the meaningful strings according to the position word forming probability of the text strings after scanning the adjacent character pairs.

The context adjacency characteristic quantity is one or more of adjacency set, adjacency type, adjacency entropy, adjacency pair set, adjacency pair type and adjacency pair entropy.

The repeated character strings in the recorded text and the occurrence frequency thereof are obtained by finding the repeated strings through a suffix tree algorithm, a sequitur algorithm, an n-element incremental distribution algorithm or an improved n-element incremental distribution algorithm.

The invention has the beneficial effects that: the invention relates to a method and a system for mining meaningful strings facing the Internet, which aim to mine the meaningful strings by three stages of repeated string discovery, context adjacency analysis and statistical language model analysis of a text to be recognized. The invention makes word segmentation in the pretreatment, further reduces the time complexity of repeated string discovery, and greatly improves the accuracy and recall rate of the extraction result; the spatial complexity of the repeated string discovery is O (N) (N is the size of the corpus scale), and pure text data with the size equivalent to the size of a memory can be analyzed, and the processing scale is about 10 times larger than that of the traditional suffix tree method; different characteristic quantities can be adopted according to application requirements during adjacency analysis, the adjacency entropy tends to find strings with more uniform distribution of various pragmatic environments, and the strings are widely distributed in space and have universality; and finally, the closeness degree of the combination of the two characters is measured by adopting the double-character coupling degree, and the combination with the judgment of the stop character is more flexible and intelligent.

Drawings

FIG. 1 is a process diagram of a mining method of Internet-oriented meaningful strings according to the present invention;

FIG. 2 is a flow diagram of the process of FIG. 1 for extracting meaningful strings from repetitive strings;

FIG. 3 is a flow chart of a string head and string tail judgment process of the Internet-oriented meaningful string according to the present invention;

FIG. 4 is a schematic diagram of a mining system for Internet-oriented meaningful strings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, a method and a system for mining meaningful strings oriented to the internet according to the present invention are described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention defines the character string with useful information in the internet and applied in various environments as the meaningful string. The invention provides a universal method and a universal system for mining meaningful strings, which are analyzed from the aspects of statistics, structure, pragmatics and semantics.

The invention divides the meaningful string mining method process into three stages of repeated string discovery, context adjacency analysis and language model analysis, and the whole process is shown as figure 1 and comprises the following steps:

step S100, in the repeated string discovery stage, processing the webpage corpora to obtain formatted plain text files, classifying the text files, recording repeated character strings in the text and the occurrence frequency thereof, and filtering out character strings with the occurrence frequency less than a certain threshold value.

And step S200, in the context adjacency analysis stage, calculating context adjacency characteristic quantities of each repeated string, judging whether the characteristic quantities reach a set threshold value, and filtering out text strings which do not reach the threshold value according to the judgment result.

Step S300, in the stage of analyzing the statistical language model, the text string is scanned with adjacent character pairs word by word, the coupling degree of the adjacent character pairs is searched, the text string is filtered according to the coupling degree, and then the text string is further filtered according to the position word forming probability of the text string to obtain a meaningful string.

The present invention is primarily measured using two criteria. First, the invention calculates the closeness degree of the combination of two adjacent words in a character string, and if the closeness degree is less than a certain threshold, the character string is deleted.

Secondly, the invention also tests the probability of a word appearing at its current position (position refers to the beginning or end of a word) and deletes the word if the probability is below a certain threshold.

In step S100, a process of processing the corpus of the web page to obtain formatted plain text files, classifying the text files, recording repeated character strings in the text and occurrence frequency thereof, and filtering out character strings whose occurrence frequency is less than a certain threshold value will be described in detail below.

And processing the webpage linguistic data to obtain a formatted plain text file, and then preprocessing the plain text file, including word segmentation, and converting the Chinese characters into corresponding IDs. The word segmentation part adopts a maximum matching word segmentation method with higher speed. Experiments show that the word segmentation dictionary comprises 6 multi-core words, word segmentation is carried out without unknown word recognition in the word segmentation process, and the effect of the step of maximally matching the word segmentation is obviously better than that of the result without word segmentation.

And establishing indexes for the processed ID sequences, starting to expand from the information of each single character index to obtain all repeated strings, continuously expanding to obtain long strings after the newly generated repeated strings are written into a file, and repeating iteration until interval symbols appear or the length reaches a specified threshold value, and stopping expansion. Meanwhile, adjacent word information and document information of each string are recorded, and each type of information is independently stored in one file.

At present, the mature repeat string discovery algorithm applied to the Chinese text comprises a suffix tree algorithm, a sequitur algorithm, an n-element incremental distribution algorithm and the like. The purpose of counting the repeated strings can be achieved by applying any algorithm. The embodiment of the invention adopts an improved n-element incremental distribution algorithm. The specific method is as follows.

The time complexity of the method is reduced compared with the time complexity of an n-element incremental algorithm, because the index records the address information of each string, the next extended character is directly positioned according to the address information and the string length during extension, the range of the statistical frequency information is only the current extended string, and the whole corpus does not need to be traversed for global comparison and statistics.

Meanwhile, adjacent word information and document information of each string are recorded, and each type of information is independently stored in one file. In the subsequent meaningful string analysis, the document information and the adjacent pair information of the strings need to be utilized, and if the statistics is carried out after the repeated strings are found, the whole corpus needs to be traversed for many times, so that the time and the expense are increased. The address information of each string is known when the string is found to be repeated, and can be obtained almost without increasing the time complexity.

Through experimental verification, if the words of the text are segmented before the repeated strings are searched, the meaningful string mining effect is good.

The following describes in detail the process of calculating the context adjacent feature quantity of each repeated string in step S200, determining whether the feature quantities reach the set threshold, and filtering out text strings that do not reach the threshold according to the determination result.

In order to describe the flexibility degree of the context environment of the character string S, the invention provides a series of concepts of context adjacent characteristic quantities, namely adjacent set, adjacent type, adjacent entropy, and adjacent pair set, adjacent pair type and adjacent pair entropy.

Adjacency set: divided into left contiguous sets L _NB And right adjacency set R _NB Respectively, refer to a set of words or morphemes adjacent to the left or right of the character string S in the real text.

Adjacent species: into left adjacent category V _L And right adjacent type V _R The numbers of the type characters or word elements in the left adjacent set and the right adjacent set respectively reflect the character string SText and text context categories.

Contiguous entropy: information entropy representing a contiguous set of strings S, the strings S having left and right contiguous entropy.

Accordingly, the concept of contextual adjacency feature values such as adjacency pair set, adjacency pair category, adjacency pair entropy, and the like is also proposed.

Set of contiguous pairs: the left and right adjacent elements of each occurrence of the string S form an adjacent pair < Li, ri >, and all adjacent pairs of the string S form an adjacent pair set PNB.

The kind of the adjacent pair: the number of elements in the set of adjacency pairs PNB is referred to as the adjacency pair type VP.

Adjacent pair entropy: representing the entropy of information for the set of contiguous pairs.

These contextual adjacency characteristics can be used to measure a string context.

As shown in fig. 2, the context adjacency analysis mainly calculates the context adjacency characteristic quantities of each repeated string, including the adjacency set, the adjacency type, the adjacency entropy, the adjacency pair set, the adjacency pair type, the adjacency pair entropy, and the like, and determines whether these characteristic quantities reach a set threshold, if so, the string is more flexible in language use, and enters the statistical language model analysis stage.

The context adjacency characteristic quantity of the repeated string is calculated, and comprises an adjacency set and an adjacency type, and an adjacency pair set and an adjacency pair type, and is obtained by counting repeated string corpora.

The entropy (including adjacency entropy and adjacency-pair entropy) is obtained by calculation.

The formula for calculating entropy is as follows:

such as contiguous set (e.g., left contiguous set) L _NB Each element of _i Corresponding to one occurrence frequency n in the real text _i And the sum of the frequencies is marked as N, the calculation formula of the entropy is as follows:

for example: the new word "avian influenza" was frequently used since 2000 and appears in the following sentence:

zhong Nashan discloses that avian influenza virus has not significantly mutated.

The situation of preventing and controlling avian influenza in Guangdong is gradually reduced.

There were 7 cases of avian influenza infection.

A suspected case of avian influenza is found.

5 prohibitions are issued to prevent and control avian influenza.

If the word is used as the granularity of adjacency analysis, the calculation result of the context adjacency characteristic quantity in the character strings of 'avian influenza' is as follows:

left contiguous set: l is _NB = { reveal, prevention and control, infection, one block }

Right adjacency set: r _NB = { virus, situation, event, suspected, EOS }

Left adjacent species: v _L ＝4

Right adjacent species: v _R ＝5

Left adjacent entropy:

right adjacent entropy:

set of contiguous pairs PNB = { < reveal, virus >, < prevention, situation >, < infection, event >, < block, suspected >, < prevention, EOS > }

The kind of the adjacent pair: PNB =5

Adjacent pair entropy

If the characteristic quantity does not reach the threshold value, the string is a garbage string, and the garbage string is filtered. Wherein the threshold is trained from the corpus.

Corpora are linguistic materials that actually appear in the actual use of a language; the electronic computer is used as a carrier to bear basic resources of language knowledge. The real corpus needs to be processed (analyzed and processed) to become a useful resource.

The corpus training method is a prior art, such as a method of training corpus by Hidden Markov Model (HMM). Which is not the inventive point of the present invention, and therefore, will not be described in detail one by one in the present invention.

Experiments verify that the unit of the adjacent elements is higher in word accuracy than word accuracy.

The following describes in detail the process of step S300, scanning the text string word by word for adjacent word pairs, finding the degree of coupling between adjacent word pairs, filtering the text string according to the degree of coupling, and then further filtering the text string according to the word formation probability of the text string to obtain a meaningful string.

In order to describe how closely two consecutive words in a word are combined, the present invention defines the concept of the degree of coupling between adjacent pairs of words. The definition is as follows: scanning all the appeared continuous word pairs in the segmented training corpus, and counting the total times of the appearance of each group of word pairs and the total times of the word pairs as a word substring, wherein the ratio of the latter to the former is called the coupling degree of the adjacent word pairs and is represented by a symbol Coup. For example, the doubleword pair "cross eyes" occurs 16 times in the statistics, wherein 12 times occur in the words "cross eyes forgetting", "one-to-one cross eyes", and 4 times in the context of "over-current", so Coup (< cross eyes >) = 12/(12 + 4) =0.75.

A higher Coup value indicates a higher degree of binding for the word pair, and vice versa indicates a less likely occurrence of the word pair in a word. The degree of coupling is obtained from the corpus.

In addition, the invention introduces position word forming probability to express the probability of a certain Chinese character appearing at a certain position (the initial word or the position of the word, etc.). For example, if the word "A" appears at the end of a word, the word can be basically considered as a garbage string. The position word forming probability is also obtained by the training corpus.

Before the analysis of the language module, a part of the training corpus should be labeled manually to generate a coupling degree dictionary (such as a double-word coupling degree dictionary) of adjacent words and a word position word forming probability dictionary.

As shown in fig. 3, first, two adjacent word pairs are scanned word by word, and the degree of coupling between the adjacent word pairs, such as the degree of coupling between two words, is found, and when the degree of coupling between two words is smaller than a set threshold, the word does not form a part of a certain word, and should be deleted as a garbage string.

And the character string which is not deleted in the double-character pair scanning process is filtered in the next step, and the probability of word formation at the position of a single character is searched. The position of the first character is searched for word forming probability, if the probability is lower than a certain threshold value, the word is represented to be not to appear in the prefix, and then the word is filtered.

And searching the position word forming probability of the tail character of the character string which is not deleted to judge whether the head and the tail of the character string contain common functional characters or not, and filtering the character string if the head and the tail of the character string contain the common functional characters. That is, if the position word forming probability is lower than the set threshold, it means that the character string should not appear at the end of the word, and it is filtered.

Preferably, the first word pair in the string is also extracted to judge the double-word coupling degree, if the double-word coupling degree is greater than a certain threshold value, the word pair is considered to be tightly combined to form the head of a certain word, and the word forming probability of the single word position of the first word is not judged any more, so that the problem of the absolute of the junk head dictionary can be avoided. For example, if the position of the first word is judged to be word forming probability, filtering may be needed, but the degree of bigram coupling of the word pair is firstly judged, and the degree of coupling is found to be high, and the word pair is reserved.

Through this step, character strings that have not been filtered out are determined as meaningful strings. These sense strings are output and the process ends.

Wherein, all the threshold values in the process are obtained by training the training corpus.

The method comprises the steps of taking original webpages of 9 domestic news website experiments from Xinlang, neyi and the like as a part of original webpages of test data, wherein the acquisition time is between 2006, 4 and 19 months and 2006, 6 and 14 months, a total of 31 ten thousand of webpages serve as the test data, the size of the test data is 12G, and after a text is extracted, the size of the final text is 470MB. The method for mining the meaningful strings can achieve 70.55% of the accuracy of extracting the meaningful strings on the news webpages.

Corresponding to the mining method of internet-oriented meaningful strings, the present invention further provides an internet-oriented meaningful string mining system 400, as shown in fig. 4, including:

the repeated string finding module 410 is configured to process the web page corpus to obtain a formatted plain text file, classify the text file, record repeated character strings in the text and occurrence frequency thereof, and filter out character strings with occurrence frequency less than a certain threshold.

The context adjacency analyzing module 420 is configured to calculate context adjacency feature quantities of each repeated string, determine whether the feature quantities reach a set threshold, and filter out text strings that do not reach the threshold according to the determination result.

The statistical language model analysis module 430 is configured to scan word-by-word adjacent word pairs of the text string, find a degree of coupling between the adjacent word pairs, filter the text string according to the degree of coupling, and then further filter the text string according to the position word forming probability of the text string to obtain a meaningful string.

The mining system 400 for internet-oriented meaningful strings of the present invention operates in the same process as the mining method for internet-oriented meaningful strings, and therefore, in the embodiment of the present invention, the system will not be described repeatedly.

While particular embodiments of the present invention have been described and illustrated, such embodiments should be considered as illustrative only and not as limiting the invention, which is to be construed in accordance with the accompanying claims.

Claims

1. An Internet-oriented meaningful string mining method is characterized by comprising the following steps:

step A, repeating character string discovery;

step B, filtering the character string through context adjacency analysis;

and C, analyzing and filtering the character strings through a language model.

2. The method for mining internet-oriented meaningful strings according to claim 1, wherein the step a comprises the following steps:

3. The method for mining internet-oriented meaningful strings according to claim 2, wherein the step B comprises the following steps:

4. The Internet-oriented meaningful string mining method according to claim 3, wherein the step C comprises the steps of:

5. The method for mining internet-oriented meaningful strings according to claim 2, wherein the step A1 comprises the following steps:

step A11, processing the webpage corpus to obtain a formatted plain text file, and then converting the Chinese characters into corresponding IDs;

6. The method for mining Internet-oriented meaningful strings according to claim 3, wherein the step B1 comprises the following steps:

step B12, if the threshold value is reached, the step C is carried out;

7. The Internet-oriented meaningful string mining method according to claim 4, wherein the step C1 comprises the following steps:

in step C16, characters that have not been filtered are determined to be meaningful strings.

8. The Internet-oriented meaningful string mining method according to claim 4, wherein the step C1 comprises the following steps:

step C11', labeling a part of the training corpus to generate a coupling degree dictionary of adjacent words and a word forming probability dictionary of single word positions;

and step C12', taking out the first character pair in the character string, judging the coupling degree of the adjacent characters, if the coupling degree is more than a threshold value, considering that the character pair is tightly combined to form the head of the character, and not judging the word forming probability of the single character position of the first character.

9. An internet-oriented meaningful string mining system, comprising:

and the statistical language model analysis module is used for scanning adjacent character pairs word by word of the text string, searching the coupling degree of the adjacent character pairs, and filtering the text string according to the coupling degree to obtain the meaningful string.

10. The system of claim 9, wherein the statistical language model analysis module is further configured to filter the strings to obtain the meaningful strings according to the position word-forming probability of the text strings after scanning the adjacent word pairs.

11. The system for mining internet-oriented meaningful strings according to claim 9 or 10, wherein the contextual adjacency feature quantity is one or more of an adjacency set, an adjacency category, an adjacency entropy, an adjacency pair set, an adjacency pair category, and an adjacency pair entropy.

12. The system for mining meaningful internet-oriented strings as claimed in claim 9 or 10, wherein the repeated strings and the occurrence frequency thereof in the recorded text are obtained by repeated string discovery through a suffix tree algorithm, a sequitur algorithm, an n-ary incremental distribution algorithm or a modified n-ary incremental distribution algorithm.