CN101788989A - Vocabulary information processing method and system - Google Patents

Vocabulary information processing method and system Download PDF

Info

Publication number
CN101788989A
CN101788989A CN200910077558A CN200910077558A CN101788989A CN 101788989 A CN101788989 A CN 101788989A CN 200910077558 A CN200910077558 A CN 200910077558A CN 200910077558 A CN200910077558 A CN 200910077558A CN 101788989 A CN101788989 A CN 101788989A
Authority
CN
China
Prior art keywords
lexical
data
information
semantic
measured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910077558A
Other languages
Chinese (zh)
Inventor
蔡亮华
庞然
胡新宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN200910077558A priority Critical patent/CN101788989A/en
Publication of CN101788989A publication Critical patent/CN101788989A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to vocabulary information processing method and system. The method comprises the following steps of: obtaining vocabulary information to be measured from the Internet and leading the vocabulary information to be measured to generate standardized data; extracting part of data from the standardized data according to set conditions to form extracted data; carrying out vocabulary cut matching on the extracted data to form vocabulary cut data; carrying out clustering processing on the vocabulary cut data and classifying and storing the vocabulary cut data after clustering processing; respectively carrying out vocabulary semantic information analysis on the classified and stored vocabulary cut data, calculating a specific gravity value of vocabulary semantic information and calculating a vocabulary semantic parameter of the vocabulary cut data according to the specific gravity value; and carrying out comprehensive measure on the vocabulary semantic parameter to obtain an evaluating result. The vocabulary information processing method and system realize the comprehensive and omnibearing evaluation on a specific subject by carrying out clustering processing on the vocabulary information to be measured and carrying out objective classification and evaluation on the vocabulary information to be measured and prevent the subjective and unilateral evaluation of an Internet user on the specific theme.

Description

Vocabulary information processing method and system
Technical field
The present invention relates to network technology, relate in particular to a kind of vocabulary information processing method and system based on the internet.
Background technology
Along with rapid development of network technique, the internet becomes the carrier of bulk information, how to extract effectively and utilizes these information to become a great challenge.Search engine (Search Engine) becomes the inlet and the guide of user capture internet as the instrument of auxiliary people's retrieving information.Web crawlers is a program of extracting webpage automatically as the important composition of search engine, is used to search engine to download webpage from the internet.
The legacy network reptile is from URL(uniform resource locator) (the UniformResource Locator of one or several Initial pages, hereinafter to be referred as: URL) beginning, obtain the URL on the Initial page, in the process that grasps webpage, constantly extract new URL and put into the URL formation, up to the condition that stops search that satisfies default from current page.In addition, all will be stored by the webpage that web crawlers grasps, and after the webpage that is grasped is analyzed, filtered, set up index, so that the user is to the inquiry and the retrieval of relevant information.
In the prior art, the info web that search engine only can provide web crawlers to grasp to the user, can not grasp the only relevant information that the user wants with particular topic, the info web that the user still needs web crawlers is grasped is screened, and should the examination process have higher subjectivity.In addition, when the user arrives the relevant information of particular topic (such as a certain concrete incident or a certain concrete personage) by search engine retrieving, the user can only obtain the click frequency about this webpage, simple evaluation result such as media exposure degree about this particular topic, this evaluation result can only show the concern temperature of this concrete incident under internet environment, the user can not learn objective omnibearing evaluation and test at this concrete incident relevant information at the concern temperature of this concrete incident under internet environment, thereby the evaluation of this particular topic is had subjective, unilateral evaluation.
Summary of the invention
The object of the present invention is to provide a kind of vocabulary information processing method and system, information releasing on the internet is carried out objective comprehensive evaluation and test, avoided the Internet user that particular topic is had subjective, unilateral evaluation.
For achieving the above object, the invention provides a kind of vocabulary information processing method, may further comprise the steps:
Obtain lexical information to be measured from the internet, described lexical information to be measured is generated standardized data, described standardized data adopts the storage of 2-D data tableau format;
According to the extracting part divided data from described standardized data that imposes a condition, form extracted data;
Described extracted data is cut the speech coupling, form and cut the speech data, the described speech data of cutting are carried out clustering processing, and the described speech data qualification of cutting after the clustering processing is stored;
The speech data of cutting after the classification and storage are carried out the lexical semantic information analysis respectively, calculate the rate of specific gravity of lexical semantic information, calculate described lexical semantic parameter of cutting the speech data according to described rate of specific gravity;
Described lexical semantic parameter is carried out composite measurement, obtain evaluation result.
Describedly obtain lexical information to be measured from the internet, described lexical information to be measured generated standardized data be specially:
Lexical semantic according to described vocabulary to be measured is retrieved automatically, obtains lexical information to be measured from the internet;
Described lexical information to be measured is downloaded to local data base;
The lexical information to be measured that downloads to described local data base is generated described standardized data.
Described described extracted data is cut speech coupling, forms and cut the speech data and be specially:
Search the pairing character string of described extracted data in local dictionary, the character string in pairing character string of more described extracted data and the described local dictionary is cut the speech data with described extracted data generation.
The described speech data of cutting are carried out clustering processing employing K-Means clustering method, perhaps Kohonen neural network clustering method.
Described the speech data of cutting after the classification and storage carried out the lexical semantic information analysis and are specially:
According to the semanteme of the character string of storing in the semantic database that sets in advance, explain the lexical semantic of cutting the pairing character string of speech data after the described classification and storage, obtain the lexical semantic parameter, and calculate the rate of specific gravity of lexical semantic information.
The lexical information to be measured that vocabulary information processing method of the present invention passes through a certain particular topic obtained carries out clustering processing, lexical information to be measured is carried out objective classification, and calculate the rate of specific gravity of sorted lexical semantic information in each classification, calculate the lexical semantic parameter of each classification according to the rate of specific gravity of lexical semantic information, by evaluation and test to the lexical semantic parameter in each classification, further obtain the objective comprehensive comprehensive evaluation and test of a certain particular topic, avoided the Internet user that this particular topic is had subjective, unilateral evaluation.
For achieving the above object, the present invention also provides a kind of lexical information disposal system, comprising:
Acquisition module is used for obtaining from the internet lexical information to be measured, and described lexical information to be measured is generated standardized data, and wherein, standardized data adopts the storage of 2-D data tableau format;
Abstraction module is used for forming extracted data according to imposing a condition from the standardized data extracting part divided data of described acquisition module;
Word frequency cluster module is used for the extracted data of described abstraction module is cut the speech coupling, forms and cuts the speech data, and the described speech data of cutting are carried out clustering processing, and the described speech data qualification of cutting after the clustering processing is stored;
The lexical semantic parsing module is used for the speech data of cutting after the described word frequency cluster module classification and storage are carried out the lexical semantic information analysis, calculates the rate of specific gravity of lexical semantic information, calculates described lexical semantic parameter of cutting the speech data according to described rate of specific gravity;
Semantic measurement module is used for the lexical semantic parameter of described lexical semantic parsing module is measured, and obtains evaluation result.
Described acquisition module comprises:
Automatically retrieval unit is used for retrieval automatically and obtains lexical information to be measured from the internet;
Local data base is used to preserve the lexical information to be measured that obtains from described automatic retrieval unit, and described lexical information to be measured is generated standardized data.
Described word frequency cluster module comprises:
Cut the speech unit, be used for searching the pairing character string of described extracted data at local dictionary, the character string in pairing character string of more described extracted data and the described local dictionary is cut the speech data with described extracted data generation;
Cluster cell is used for the described speech data of cutting are carried out clustering processing;
Storage unit is used to store the described speech data of cutting after the described cluster cell clustering processing.
Described storage unit comprises at least two storing sub-units.
Described lexical semantic parsing module comprises:
Semantic resolution unit is provided with semantic database, is used for the speech data of cutting after the described word frequency cluster module classification and storage are carried out the lexical semantic information analysis;
Semantic measuring unit is used to calculate the rate of specific gravity of lexical semantic information, calculates the described lexical semantic of cutting the speech data according to described rate of specific gravity;
Record cell is used for writing down the Unrecorded speech data of cutting of described word frequency cluster module, and the described speech data of cutting that will write down feed back to described semantic resolution unit.
A kind of lexical information disposal system provided by the invention, by word frequency cluster module the lexical information to be measured that is obtained is carried out clustering processing, the cut speech data of lexical semantic parsing module after to classification and storage in the word frequency cluster module are carried out the lexical semantic information analysis, calculate the rate of specific gravity of lexical semantic information, calculate described lexical semantic parameter of cutting the speech data according to described rate of specific gravity, classify by cutting the speech data, the classification of objective judgement lexical information to be measured has realized lexical information to be measured is carried out objective comprehensive comprehensive evaluation and test.
Description of drawings
Fig. 1 is the schematic flow sheet of vocabulary information processing method embodiment one of the present invention;
Fig. 2 is the schematic flow sheet of vocabulary information processing method embodiment two of the present invention;
Fig. 3 is the structural representation of lexical information disposal system embodiment one of the present invention;
Fig. 4 is the structural representation of lexical information disposal system embodiment two of the present invention.
Embodiment
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
Fig. 1 is the schematic flow sheet of vocabulary information processing method embodiment one of the present invention, and as shown in Figure 1, present embodiment may further comprise the steps:
Step 101, obtain lexical information to be measured from the internet, lexical information to be measured is generated standardized data, standardized data adopts the storage of 2-D data tableau format.
In step 101, can obtain lexical information to be measured from the internet by web crawlers, lexical information to be measured is the information about a certain particular topic; Concrete operation is: web crawlers obtains Initial page since a URL, and constantly extracts new URL from webpage, thereby can obtain from the internet in a large number, abundant vocabulary information.Wherein, URL both can also can be portal website for common webpage; If common webpage, web crawlers can directly obtain lexical information to be measured from web page contents, if portal website, web crawlers can obtain the keyword of lexical information to be measured from the headline of portal website's homepage.Standardized data adopts the storage of 2-D data tableau format, and the concrete structure of this two-dimensional data table is as shown in table 1.
Table 1
Particular topic The position appears The frequency Time
Mrs Zhou Website 1 ??120 On Dec 18th, 2008
Mrs Zhou Website 2 ??34 On Dec 18th, 2008
Particular topic The position appears The frequency Time
??... ??... ??... ??...
Mrs Zhou Website N ??3482 On Dec 10th, 2008
In the above-mentioned table 1, the relevant information of first dimension (laterally) the expression particular topic of two-dimensional data table, be specially: occur the position that location records particular topic " Mrs Zhou " occurs in URL (website 1, website 2 ..., website N, N website altogether), the frequency writes down the frequency that particular topic occurs, the contents such as time that the time keeping particular topic is issued on the internet related web page in webpage; Total N the to be measured lexical information relevant with particular topic appears on N the website in second dimension (vertically) the expression particular topic.In the practical application, also can set the 2-D data tableau format according to actual needs.
Step 102, according to the extracting part divided data from standardized data that imposes a condition, form extracted data.
In step 102, according to actual needs standardized data is extracted.If need to extract with " Mrs Zhou " at lexical information to be measured relevant aspect " public good ", find out and the close character string of " public good " lexical semantic, as lexical informations to be measured such as " charitable ", " contribution ", " charity bazaar ", " whip-round ", " relieving the people in stricken areas ", the known character string that the pairing character string of above-mentioned lexical information to be measured is close with lexical semantic is carried out fuzzy matching, realization further obtains extracted data to the extraction of standardized data.Wherein, the form of the standardized data of obtaining in the form of extracted data and the step 101 is identical, shown in above-mentioned table 1.Also can impose a condition standardized data is extracted according to other.
Step 103, extracted data is cut speech coupling, form and cut the speech data, carry out clustering processing, and will cut the storage of speech data qualification after the clustering processing cutting the speech data.
In step 103, the form of the standardized data of obtaining in the form of cutting the speech data and the step 101 is identical, shown in above-mentioned table 1.To cutting the speech data when carrying out clustering processing, carry out cluster according to the frequency that standardized data occurs, the speech data qualification of cutting after the clustering processing is stored in a plurality of storage unit, wherein each storage unit be used to store lexical semantic close cut the speech data.For example: in the standardized data of expression " video display ", " song ", because " video display " and " song " belong to the vocabulary of amusement class, therefore can give the two identical lexical semantic, after clustering processing, the two is counted in the storage unit of same expression amusement class.In the standardized data of expression " video display ", " economy ", because " video display " represent diverse lexical semantic with " economy ", wherein " video display " belong to the amusement class, " economy " belongs to commercial, therefore the two does not have identical lexical semantic, therefore will be stored in respectively in the storage unit of the different lexical semantic of expression.
Step 104, the speech data of cutting after the classification and storage are carried out the lexical semantic information analysis respectively, calculate the rate of specific gravity of lexical semantic information, calculate the lexical semantic parameter of cutting the speech data according to rate of specific gravity.
In step 104, resolve the lexical semantic information that is classified storage according to the semantic database that sets in advance, calculate the rate of specific gravity of lexical semantic information, calculate the lexical semantic parameter cut the speech data according to rate of specific gravity: for example be classified cutting in the speech data after the storage, obtain different lexical semantic information, if being classified in the lexical semantic information of storage, one of them comprises " video display ", " amusement ", " public good " etc., the frequency of this extracted data in webpage that is write down in two-dimensional data table according to extracted data carried out the cluster result that clustering processing obtains extracted data so, calculate the rate of specific gravity that this cuts the represented lexical semantic information of speech data according to cluster result, if calculate " video display " proportion according to cluster result is 20%, " amusement " proportion is 10%, " public good " proportion is 70%, then can determine the lexical semantic parametric representation commonweal information of this lexical semantic information; If being classified in the lexical semantic information of storage, another comprises " video display ", " amusement ", " public good " etc., carry out the cluster result that clustering processing obtains extracted data according to cut the frequency of this standardized data in webpage that the speech data are write down in two-dimensional data table so, calculate the rate of specific gravity that this cuts the represented lexical semantic information of speech data according to this cluster result, if calculate " video display " proportion according to cluster result is 60%, " amusement " proportion is 30%, " public good " proportion is 10%, then can determine the lexical semantic parametric representation video display information of this lexical semantic information.
Step 105, the lexical semantic parameter is carried out composite measurement, obtain evaluation result.
In step 105,, draw the evaluation result of the lexical information to be evaluated of a certain particular topic according to proportion value in each classification of each lexical semantic parameter that calculates after being classified in the step 104 and corresponding lexical semantic information.Wherein, owing to time and the frequency of the lexical information to be measured that is obtained in step 101 along with appearance on the internet constantly changes, therefore the evaluation result that is obtained changed along with the time.
Particularly, particular topic is evaluated and tested for " Mrs Zhou ", this " Mrs Zhou " information need be known, this " Mrs Zhou " lexical information to be measured can be obtained in each side such as " public good ", " video display ", " songs " by step 101 at aspects such as " public good ", " video display ", " songs "; Can obtain this " Mrs Zhou " in concrete relative words in aspect such as " public good ", " video display ", " songs " by step 102 and step 103; Can obtain this " Mrs Zhou " measurement by step 104 in the various aspects of aspects such as " public good ", " video display ", " song "; Can obtain the evaluation result of this " Mrs Zhou " by step 105.
Vocabulary information processing method of the present invention carries out the rate of specific gravity that clustering processing is obtained the classification of lexical information to be measured and calculated lexical semantic information by the lexical information to be measured that will be obtained, calculate the lexical semantic parameter of a certain classification of lexical information to be measured according to the rate of specific gravity of lexical semantic information, final passing through to the lexical semantic CALCULATION OF PARAMETERS, obtain evaluation result, thereby realized lexical information to be measured is carried out objective comprehensive comprehensive measurement, avoided the Internet user that particular topic is had subjective, unilateral evaluation.
Fig. 2 is the schematic flow sheet of vocabulary information processing method embodiment two of the present invention, and as shown in Figure 2, present embodiment may further comprise the steps:
Step 201, retrieve automatically, obtain lexical information to be measured from the internet according to the lexical semantic of vocabulary to be measured.
In step 201, retrieve automatically according to the semanteme of vocabulary to be measured, obtain lexical information to be measured from the internet, this automatic retrieving can be realized by the web crawlers with semantic identification.Two-dimensional data table comprises: lexical information to be measured, position, frequency of occurrence, time of occurrence.Also can set the structure of two-dimensional data table according to needs.
Obtain lexical information to be measured by automatic retrieval, retrieve automatically, thereby reduced the vocabulary of obtaining from the internet, save storage space according to the semanteme of vocabulary to be measured.
Step 202, lexical information to be measured is downloaded to local data base.
Step 203, the lexical information to be measured that will download to local data base generate two-dimensional data table, generate standardized data, and standardized data adopts the storage of 2-D data tableau format.
Wherein the 2-D data tableau format also can be set according to actual needs with identical shown in the above-mentioned table 1.
Step 204, according to the extracting part divided data from standardized data that imposes a condition, form extracted data.
In step 204, according to actual needs standardized data is extracted.If need to extract with " Mrs Zhou " at lexical information to be measured relevant aspect " public good ", find out and the close character string of " public good " lexical semantic, as lexical informations to be measured such as " charitable ", " contribution ", " charity bazaar ", " whip-round ", " relieving the people in stricken areas ", the known character string that the pairing character string of above-mentioned lexical information to be measured is close with lexical semantic is carried out fuzzy matching, realization is to the extraction of standardized data, thereby obtains extracted data.Wherein, the form of the standardized data of obtaining in the form of extracted data and the step 201 is identical, shown in above-mentioned table 1.Also can impose a condition standardized data is extracted according to other.
Step 205, in local dictionary, search the pairing character string of extracted data, compare the character string in pairing character string of extracted data and the local dictionary, the speech data are cut in the extracted data generation, carry out clustering processing, and the speech data qualification of cutting after the clustering processing is stored cutting the speech data.
In step 205, the form of the standardized data of obtaining in the form of cutting the speech data and the step 201 is identical, shown in above-mentioned table 1.Preserve a large amount of known lexical data corresponding characters strings in the local dictionary, when extracted data being cut the speech coupling, sentence or the heading message in the site home page that occurs in the generic web page can be cut into the pairing character string of independent vocabulary.For example when " Beijing Olympic " being cut the speech coupling, mate with " Beijing Olympic " respectively according to " Beijing " of preserving in the local dictionary and " Olympic Games " two pairing character strings of vocabulary, cut the speech result and obtain " Beijing " and " Olympic Games " two speech, but not " Beijing Austria " and " fortune " wait other non-common vocabulary.Wherein, cutting the frequency that the speech data occur according to standardization in webpage adds up, clustering processing can adopt the K-Means clustering method, also can adopt Kohonen neural network clustering method, those of ordinary skills can realize the cluster statistics according to above-mentioned two kinds of clustering methods, therefore here repeat no more.
The semanteme of the character string of storing in step 206, the basis semantic database that sets in advance, the lexical semantic of cutting the pairing character string of speech data after the interpretive classification storage obtains the lexical semantic parameter, and calculates the rate of specific gravity of lexical semantic information.
In step 206, owing to preserve a large amount of known vocabulary in the semantic database, therefore resolve according to a large amount of known vocabulary that is provided with in the semantic database and cut the speech data, can obtain the lexical semantic information of cutting the speech data, and in conjunction with the cluster result in the abovementioned steps 205, calculate the rate of specific gravity that this cuts speech data lexical semantic information, can calculate the lexical semantic parameter of cutting the speech data according to this rate of specific gravity.For example: according to cluster result calculate that " video display " proportion is 20%, " amusement " proportion is 10%, " public good " proportion is 70%, then can determine the lexical semantic parametric representation commonweal information of this lexical semantic information.
Step 207, the lexical semantic parameter is carried out composite measurement, obtain evaluation result.
In step 207,, draw the evaluation result of the lexical information to be evaluated of a certain particular topic according to the rate of specific gravity of each lexical semantic parameter that calculates after being classified in the step 206 and corresponding lexical semantic information.Wherein, owing to time and the frequency of the lexical information to be measured that is obtained in step 201 along with appearance on the internet constantly changes, therefore the evaluation result that is obtained changed along with the time.
Vocabulary information processing method embodiment two of the present invention retrieves automatically according to the semanteme of lexical information to be measured, lexical information to be measured selectively can be downloaded to local data base, saves storage space.Carry out the rate of specific gravity that clustering processing is obtained the classification of lexical information to be measured and calculated lexical semantic information by the lexical information to be measured that will be obtained, calculate the lexical semantic parameter of a certain classification of lexical information to be measured according to the rate of specific gravity of lexical semantic information, final passing through to the lexical semantic CALCULATION OF PARAMETERS, obtain evaluation result, thereby realized lexical information to be measured is carried out objective comprehensive comprehensive measurement, avoided the Internet user that particular topic is had subjective, unilateral evaluation, and the evaluation result that obtains change along with the variation of time.
Fig. 3 is the structural representation of lexical information disposal system embodiment one of the present invention, and as shown in Figure 3, the lexical information disposal system comprises: acquisition module 31, abstraction module 32, word frequency cluster module 33, lexical semantic parsing module 34 and semantic measurement module 35.
Wherein, acquisition module 31 obtains lexical information to be measured from the internet, and lexical information to be measured is generated standardized data; Abstraction module 32 forms extracted data according to the standardized data extracting part divided data that imposes a condition from acquisition module 31; The extracted data of 33 pairs of abstraction modules 32 of word frequency cluster module is cut the speech coupling, forms and cuts the speech data, carries out clustering processing to cutting the speech data, and the speech data qualification of cutting after the clustering processing is stored; The speech data of cutting in 34 pairs of word frequency clusters of lexical semantic parsing module module 33 after the classification and storage are carried out the lexical semantic information analysis, calculate the rate of specific gravity of lexical semantic information, calculate the lexical semantic parameter of cutting the speech data according to rate of specific gravity, the lexical semantic parameter of 35 pairs of lexical semantic parsing modules 34 of semantic measurement module is measured, and obtains evaluation result.
Lexical information disposal system embodiment one of the present invention carries out clustering processing by word frequency cluster module 33 with the lexical information to be measured that is obtained, the speech data of cutting in 34 pairs of word frequency clusters of lexical semantic parsing module module 33 after the classification and storage are carried out the lexical semantic information analysis, calculate the rate of specific gravity of lexical semantic information, calculate the lexical semantic parameter of cutting the speech data according to rate of specific gravity, classify by cutting the speech data, the classification of objective judgement lexical information to be measured has realized lexical information to be measured is carried out objective comprehensive comprehensive evaluation and test.
Fig. 4 is the structural representation of lexical information disposal system embodiment two of the present invention, and as shown in Figure 4, the lexical information disposal system comprises: acquisition module 41, abstraction module 42, word frequency cluster module 43, lexical semantic parsing module 44 and semantic measurement module 45.
Acquisition module 41 comprises: automatic retrieval unit 411, local data base 412; Word frequency cluster module 43 comprises: cut speech unit 431, cluster cell 432, storage unit 433; Lexical semantic parsing module 44 comprises: semantic resolution unit 441, semantic measuring unit 442, selectively, lexical semantic parsing module 44 also can comprise record cell 443, be used for writing down the semantic resolution unit 441 Unrecorded speech data of cutting, and the speech data of cutting that will write down feed back to semantic resolution unit 441.
Wherein, the automatic retrieval unit 411 of acquisition module 41 is retrieved automatically according to the lexical semantic of vocabulary to be measured, obtain lexical information to be measured from the internet, and lexical information to be measured downloaded be saved in the local data base 412 of acquisition module 41, local data base 412 generates two-dimensional data table with lexical information to be measured, generate standardized data wherein, standardized data adopts the 2-D data tableau format.This two-dimensional data table specifically can comprise: lexical information to be measured, position, the frequency, time occur.
Abstraction module 42 forms extracted data according to the standardized data extracting part divided data that imposes a condition from acquisition module 41; The pairing character string of extracted data that abstraction module 42 generates is searched in the speech unit 431 of cutting of word frequency cluster module 43 in local dictionary, and the speech data are cut in generation, cluster cell 432 will be cut the data based frequency of occurrence of speech of cutting that speech unit 431 generates and carry out clustering processing, and extracted data stores in the storage unit 433 after by clustering processing.Wherein, storage unit 433 comprises at least two storing sub-units.
Semantic resolution unit 441 is provided with semantic database, will carry out the lexical semantic information analysis to the speech data of cutting in the word frequency cluster module 43, and classification and storage is in storage unit 433.Semantic measuring unit 442 calculates the lexical semantic parameter of cutting the speech data according to the lexical semantic information of the semantic resolution unit 441 after resolving.
Semantic measuring unit 442 calculates the rate of specific gravity of lexical semantic information, and calculates the lexical semantic parameter of cutting the speech data according to rate of specific gravity.The lexical semantic parameter of 45 pairs of lexical semantic parsing modules 44 of semantic measurement module is carried out composite measurement, obtains evaluation result.
Wherein, the lexical semantic parsing module also comprises record cell 443, be used for writing down the Unrecorded speech data of cutting of lexical semantic parsing module, and the speech data of cutting that will write down feed back to semantic resolution unit 441. and are carrying out semanteme when resolving when lexical semantic parsing module 44, lexical semantic parsing module 44 can record the unknown semantics vocabulary that semantic resolution unit 441 runs in the record cell 443, record cell 443 by semantic vocabulary self study after, the lexical semantic of the unknown is generated known semantic vocabulary, resolve semantic scope thereby enlarged semantic resolution unit.
Lexical information disposal system embodiment of the present invention two by record cell 443 by semantic vocabulary self study after, the lexical semantic of the unknown is generated known semantic vocabulary, thereby has enlarged the scope that semantic resolution unit 441 is resolved semantemes.By word frequency cluster module 43 lexical information to be measured that is obtained is carried out clustering processing, the speech data of cutting in 44 pairs of word frequency clusters of lexical semantic parsing module module 43 after the classification and storage are carried out the lexical semantic information analysis, calculate the rate of specific gravity of lexical semantic information, the lexical semantic of speech data is cut in calculating according to rate of specific gravity, classify by cutting the speech data, the classification of objective judgement lexical information to be measured has realized lexical information to be measured is carried out objective comprehensive comprehensive evaluation and test.
It should be noted that at last: above embodiment only in order to technical scheme of the present invention to be described, is not intended to limit; Although with reference to previous embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment put down in writing, and perhaps part technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (10)

1. a vocabulary information processing method is characterized in that, may further comprise the steps:
Obtain lexical information to be measured from the internet, described lexical information to be measured is generated standardized data, described standardized data adopts the storage of 2-D data tableau format;
According to the extracting part divided data from described standardized data that imposes a condition, form extracted data;
Described extracted data is cut the speech coupling, form and cut the speech data, the described speech data of cutting are carried out clustering processing, and the described speech data qualification of cutting after the clustering processing is stored;
The speech data of cutting after the classification and storage are carried out the lexical semantic information analysis respectively, calculate the rate of specific gravity of lexical semantic information, calculate described lexical semantic parameter of cutting the speech data according to described rate of specific gravity;
Described lexical semantic parameter is carried out composite measurement, obtain evaluation result.
2. vocabulary information processing method according to claim 1 is characterized in that, describedly obtains lexical information to be measured from the internet, described lexical information to be measured is generated standardized data be specially:
Lexical semantic according to described vocabulary to be measured is retrieved automatically, obtains lexical information to be measured from the internet;
Described lexical information to be measured is downloaded to local data base;
The lexical information to be measured that downloads to described local data base is generated described standardized data.
3. vocabulary information processing method according to claim 1 is characterized in that, described described extracted data is cut speech coupling, forms to cut the speech data and be specially:
Search the pairing character string of described extracted data in local dictionary, the character string in pairing character string of more described extracted data and the described local dictionary is cut the speech data with described extracted data generation.
4. vocabulary information processing method according to claim 1 is characterized in that, the described speech data of cutting are carried out clustering processing employing K-Means clustering method, perhaps Kohonen neural network clustering method.
5. vocabulary information processing method according to claim 1 is characterized in that, described the speech data of cutting after the classification and storage are carried out the lexical semantic information analysis and are specially:
According to the semanteme of the character string of storing in the semantic database that sets in advance, explain the lexical semantic of cutting the pairing character string of speech data after the described classification and storage, obtain the lexical semantic parameter, and calculate the rate of specific gravity of lexical semantic information.
6. a lexical information disposal system is characterized in that, comprising:
Acquisition module is used for obtaining from the internet lexical information to be measured, and described lexical information to be measured is generated standardized data, and wherein, standardized data adopts the storage of 2-D data tableau format;
Abstraction module is used for forming extracted data according to imposing a condition from the standardized data extracting part divided data of described acquisition module;
Word frequency cluster module is used for the extracted data of described abstraction module is cut the speech coupling, forms and cuts the speech data, and the described speech data of cutting are carried out clustering processing, and the described speech data qualification of cutting after the clustering processing is stored;
The lexical semantic parsing module is used for the speech data of cutting after the described word frequency cluster module classification and storage are carried out the lexical semantic information analysis, calculates the rate of specific gravity of lexical semantic information, calculates described lexical semantic parameter of cutting the speech data according to described rate of specific gravity;
Semantic measurement module is used for the lexical semantic parameter of described lexical semantic parsing module is measured, and obtains evaluation result.
7. lexical information disposal system according to claim 6 is characterized in that, described acquisition module comprises:
Automatically retrieval unit is used for retrieving automatically according to the lexical semantic of described vocabulary to be measured, obtains lexical information to be measured from the internet;
Local data base is used to preserve the lexical information to be measured that obtains from described automatic retrieval unit, and described lexical information to be measured is generated standardized data.
8. lexical information disposal system according to claim 6 is characterized in that, described word frequency cluster module comprises:
Cut the speech unit, be used for searching the pairing character string of described extracted data at local dictionary, the character string in pairing character string of more described extracted data and the described local dictionary is cut the speech data with described extracted data generation;
Cluster cell is used for the described speech data of cutting are carried out clustering processing;
Storage unit is used to store and cuts the speech data after the described cluster cell clustering processing.
9. lexical information disposal system according to claim 8 is characterized in that described storage unit comprises at least two storing sub-units.
10. lexical information disposal system according to claim 6 is characterized in that, described lexical semantic parsing module comprises:
Semantic resolution unit is provided with semantic database, is used for the speech data of cutting after the described word frequency cluster module classification and storage are carried out the lexical semantic information analysis;
Semantic measuring unit is used to calculate the rate of specific gravity of lexical semantic information, calculates the described lexical semantic of cutting the speech data according to described rate of specific gravity;
Record cell is used for writing down the Unrecorded speech data of cutting of described word frequency cluster module, and the described speech data of cutting that will write down feed back to described semantic resolution unit.
CN200910077558A 2009-01-22 2009-01-22 Vocabulary information processing method and system Pending CN101788989A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910077558A CN101788989A (en) 2009-01-22 2009-01-22 Vocabulary information processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910077558A CN101788989A (en) 2009-01-22 2009-01-22 Vocabulary information processing method and system

Publications (1)

Publication Number Publication Date
CN101788989A true CN101788989A (en) 2010-07-28

Family

ID=42532205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910077558A Pending CN101788989A (en) 2009-01-22 2009-01-22 Vocabulary information processing method and system

Country Status (1)

Country Link
CN (1) CN101788989A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063460A (en) * 2010-10-19 2011-05-18 蔡亮华 Information processing method and device
WO2017107518A1 (en) * 2015-12-25 2017-06-29 乐视控股(北京)有限公司 Method and apparatus for parsing voice content

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063460A (en) * 2010-10-19 2011-05-18 蔡亮华 Information processing method and device
WO2017107518A1 (en) * 2015-12-25 2017-06-29 乐视控股(北京)有限公司 Method and apparatus for parsing voice content

Similar Documents

Publication Publication Date Title
CN112699246B (en) Domain knowledge pushing method based on knowledge graph
CN102144229B (en) System for extracting term from document containing text segment
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN102207948B (en) Method for generating incident statement sentence material base
CN111026671B (en) Test case set construction method and test method based on test case set
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN106156204A (en) The extracting method of text label and device
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN101609450A (en) Web page classification method based on training set
KR100974064B1 (en) System for providing information adapted to users and method thereof
Nagar et al. Using text and data mining techniques to extract stock market sentiment from live news streams
CN109960756A (en) Media event information inductive method
KR102361597B1 (en) A program recording medium on which a program for labeling sentiment information in news articles using big data is recoded
US20160110471A1 (en) Method and system of intelligent generation of structured data and object discovery from the web using text, images, video and other data
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN109857956A (en) The automatic abstracting method of news web page key message based on label and blocking characteristic
CN110163688A (en) Commodity network public sentiment detection system
CN103049581A (en) Web text classification method based on consistency clustering
CN106886512A (en) Article sorting technique and device
KR102371505B1 (en) A program for labeling news articles using big data
KR102382681B1 (en) A program for labeling sentiment information in news articles using big data
CN101788989A (en) Vocabulary information processing method and system
CN103823847A (en) Keyword extension method and device
Liao et al. Improving farm management optimization: Application of text data analysis and semantic networks
Li-Juan et al. A classification method of Vietnamese news events based on maximum entropy model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100728