CN106649334A

CN106649334A - Conjunction word set processing method and device

Info

Publication number: CN106649334A
Application number: CN201510726038.1A
Authority: CN
Inventors: 梁梦溪; 何鑫
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2017-05-10
Anticipated expiration: 2035-10-29
Also published as: CN106649334B

Abstract

The invention discloses a conjunction word set processing method and device, wherein the processing method comprises the steps of crawling a web text from a target data source on the basis of conjunction words in a conjunction word set of an object to be analyzed; performing word segmentation on the web text to obtain a plurality of text vocabularies, and obtaining the vocabulary information of each text vocabulary, wherein the vocabulary information includes conjunction index data of each text vocabulary and/or information of part of speech of each text vocabulary, and the conjunction index data is used for indicating the conjunction degree of each text vocabulary and the conjunction words; screening the conjunction index data of a plurality of text vocabularies and/or information of part of speech of a plurality of text vocabularies, and obtaining the screened conjunction vocabularies; and updating the conjunction word set by using the screened conjunction vocabularies. The method and the device provided by the invention solve the technical problem of small vocabulary quantity of the existing word bag accumulating method.

Description

The processing method and processing device of related word set

Technical field

The application is related to internet arena, in particular to a kind of processing method and processing device of related word set.

Background technology

Enterprise's release product, release service when, or government department promulgates a certain policy, and occurs a certain to cause society During the instant event that can be paid close attention to, the contents such as the related news of some network media reports certainly will be occurred on internet, These Internet news will cause the concern and discussion of netizen.Object is being analyzed for a certain (such as：Current events, product, Personage, policy etc.) network public-opinion content (i.e. related to object network text) collection during, according to Web crawlers crawls the mode of the network text relevant with analysis object come the information of collecting, due to not right when crawling Object is relevant is distinguish between with analysis for content, then after crawling and obtaining network text, need to screen it, To filter out the content related to object to be analyzed.

Typically during screening and screen text, one section of network text is judged by setting some Rule of judgment Whether this is the related content of object to be analyzed, using the set of the content related to object to be analyzed as " word bag ", general Content in " word bag " come replace analyze object, to network text process screen with filter, this process can also Referred to as word bag accumulation.

The basic skills behaviour union of existing word bag accumulation is wanted to be manually entered, more using the combined method of following vocabulary: Using object oriented to be analyzed as word bag；Using object oriented to be analyzed and synon combination as word bag；And to treat Analysis object oriented and competing product contamination are used as word bag.It can be seen that the shortcoming of existing word bag accumulation method is：Word Remittance amount is on the low side；Whether the relation between vocabulary and analysis object closely cannot quantify to weigh；It is artificial to participate in vocabulary building institute Take time longer, efficiency is low；And poor expandability.

For the method vocabulary of above-mentioned existing word bag accumulation problem on the low side, effective solution party is not yet proposed at present Case.

The content of the invention

The embodiment of the present application provides a kind of processing method and processing device of related word set, at least to solve existing word Wrap the method vocabulary of accumulation technical problem on the low side.

According to the one side of the embodiment of the present application, there is provided a kind of processing method of related word set, the process side Method includes：Network text is crawled from target data source based on the related word in the related word set of object to be analyzed； Participle is carried out to network text and obtains multiple text vocabulary, and obtain the lexical information of each text vocabulary, wherein, word Remittance information includes the coupling index data of each text vocabulary and/or the part-of-speech information of each text vocabulary, coupling index number According to for indicating the degree of association of each text vocabulary and related word；According to default screening conditions to multiple text vocabulary The part-of-speech information of coupling index data and/or multiple text vocabulary is screened, and obtains the association vocabulary for filtering out；Use The association vocabulary for filtering out updates related word set.

Further, participle is carried out to network text and obtains multiple text vocabulary, and obtain the vocabulary of each text vocabulary Information includes：After participle being carried out to network text and obtains multiple text vocabulary, the text of multiple text vocabulary is created Dictionary；Determine the coupling index data of each text vocabulary in text dictionary according to default Correlation Criteria, and/or extract text The part-of-speech information of each text vocabulary in this dictionary.

Further, determine that the coupling index data of each text vocabulary in text dictionary include according to default Correlation Criteria： If default Correlation Criteria is one, the relevance numerical value of each default Correlation Criteria of text vocabulary correspondence is obtained, obtained The coupling index data of each text vocabulary；If default Correlation Criteria is multiple, each text vocabulary correspondence is obtained each All relevance numerical value of each text vocabulary are made mixing operation by the relevance numerical value of individual default Correlation Criteria, will be melted With result as each text vocabulary coupling index data, wherein, mixing operation include weighted calculation, plus and calculate At least one of and multiplication and division calculating.

Further, determine that the coupling index data of each text vocabulary in text dictionary include according to default Correlation Criteria： Each text vocabulary is met the coupling index data of the number of times of default Correlation Criteria as each text vocabulary, wherein, Default Correlation Criteria includes：Each text vocabulary occurs simultaneously with related word in the same sentence of network text；With/ Or each text vocabulary and related word network text is occurred in identical part of speech in network text sentence in it is identical Position.

Further, the coupling index data and/or multiple text vocabulary according to default screening conditions to multiple text vocabulary Part-of-speech information screened, the association vocabulary for obtaining filtering out includes：By coupling index data in preset range Text vocabulary is used as the association vocabulary for filtering out；Or in the coupling index data of multiple text vocabulary coupling index data Ranking front N names text vocabulary as the association vocabulary for filtering out；Or by text word that lexical information is default part of speech Converge as the association vocabulary for filtering out.

Further, updating related word set using the association vocabulary for filtering out includes：Using the conjunctive word for filtering out Converge and replace related word, to update related word set；Or the association vocabulary for filtering out is added into into related word set, To update related word set.

According to the another aspect of the embodiment of the present application, a kind of processing meanss of related word set, the process are additionally provided Device includes：Unit is crawled, for the related word in the related word set based on object to be analyzed from target data Network text is crawled on source；Processing unit, for carrying out participle to network text multiple text vocabulary are obtained, and are obtained The lexical information of each text vocabulary, wherein, coupling index data of lexical information including each text vocabulary and/or each The part-of-speech information of individual text vocabulary, coupling index data are used to indicate the degree of association of each text vocabulary and related word； Screening unit, for coupling index data and/or multiple text vocabulary according to default screening conditions to multiple text vocabulary Part-of-speech information screened, obtain the association vocabulary for filtering out；Updating block, for using the conjunctive word for filtering out Converge and update related word set.

Further, processing unit includes：Creation module, for obtaining multiple texts carrying out participle to network text After vocabulary, the text dictionary of multiple text vocabulary is created；Determining module, for determining text according to default Correlation Criteria The coupling index data of each text vocabulary in this dictionary, and/or extract the part of speech letter of each text vocabulary in text dictionary Breath.

Further, it is determined that module includes：First calculating sub module, if being one for default Correlation Criteria, obtains The relevance numerical value of each default Correlation Criteria of text vocabulary correspondence is taken, the coupling index data of each text vocabulary are obtained； Second calculating sub module, if being multiple for default Correlation Criteria, obtains each each default pass of text vocabulary correspondence All relevance numerical value of each text vocabulary are made mixing operation by the relevance numerical value of bracing part, and warm result is made For the coupling index data of each text vocabulary, wherein, mixing operation includes weighted calculation, plus and calculates and multiplication and division At least one of calculate.

Further, it is determined that module includes：Determination sub-module, for each text vocabulary to be met into default Correlation Criteria Number of times as each text vocabulary coupling index data, wherein, default Correlation Criteria includes：Each text vocabulary Occur simultaneously in the same sentence of network text with related word；And/or each text vocabulary and related word are in network The same position in the sentence of network text is occurred in text with identical part of speech.

In the embodiment of the present application, web crawlers based on object to be analyzed related word set in related word from Crawl in target data source after network text, participle is carried out to network text and obtains multiple text vocabulary, and obtain each The lexical information of individual text vocabulary, and the coupling index data or many according to default screening conditions to multiple text vocabulary The part-of-speech information of individual text vocabulary is screened, after screening obtains the association vocabulary for filtering out, using what is filtered out Association vocabulary updates related word set.By above-described embodiment, the network text that can be crawled to indifference is carried out Participle and screening, obtain the association vocabulary for filtering out to update related word set, and repeating carries out participle and screening, Constantly expand and update related word set, so as to the method vocabulary for solving the problems, such as existing word bag accumulation is on the low side, Reach the effect of the related word set for improving object to be analyzed.

Description of the drawings

Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing In：

Fig. 1 is a kind of flow chart of the processing method of the related word set according to the embodiment of the present application；

Fig. 2 is the flow chart of the processing method of the optional related word set of another kind according to the embodiment of the present application；With And

Fig. 3 is a kind of schematic diagram of the processing meanss of the related word set according to the embodiment of the present application.

Specific embodiment

In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment The only embodiment of the application part, rather than the embodiment of whole.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, all should belong to The scope of the application protection.

It should be noted that the description and claims of this application and the term " first " in above-mentioned accompanying drawing, " Two " it is etc. the object for distinguishing similar, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein can with except Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they Any deformation, it is intended that covering is non-exclusive to be included, and for example, contains process, the side of series of steps or unit Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear List or other steps intrinsic for these processes, method, product or equipment or unit.

Explanation of nouns：

Analysis object：Based on network text content, it is intended to analyze the object of its public sentiment content.Possibly current events, produce Product, personage, policy etc..

Corpus：The network text that reptile crawls.

Dictionary vocabulary：Text in corpus is carried out after participle, with relation form storage between single vocabulary and vocabulary Lexicon.

Relevance：Refer to the tightness degree between multiple objects (vocabulary).

Screening logic：To the condition algorithm for screening vocabulary.

Word bag：To substitution analysis object, as screening to the network text in corpus, will wherein with analysis The set of the related information filtering of object vocabulary composition out.

Embodiment 1

According to the embodiment of the present application, there is provided a kind of embodiment of the processing method of related word set, explanation is needed It is can to hold in the such as computer system of one group of computer executable instructions the step of the flow process of accompanying drawing is illustrated OK, and, although show logical order in flow charts, but in some cases, can be with different from herein Order perform shown or described step.

Fig. 1 is a kind of flow chart of the processing method of the related word set according to the embodiment of the present application, as shown in figure 1, The processing method comprises the steps：

Step S102, based on the related word in the related word set of object to be analyzed net is crawled from target data source Network text.

Step S104, participle is carried out to network text and obtains multiple text vocabulary, and obtains the vocabulary of each text vocabulary Information, wherein, lexical information includes the coupling index data of each text vocabulary and/or the part of speech letter of each text vocabulary Breath, coupling index data are used to indicate the degree of association of each text vocabulary and related word.

Step S106, according to coupling index data and/or multiple text word of the default screening conditions to multiple text vocabulary The part-of-speech information of remittance is screened, and obtains the association vocabulary for filtering out.

Step S108, using the association vocabulary for filtering out related word set is updated.

Using the embodiment of the present application, the current association in related word set of the web crawlers based on object to be analyzed Word is crawled after network text from target data source, participle is carried out to network text and obtains multiple text vocabulary, and Obtain the lexical information of each text vocabulary, and the coupling index number according to default screening conditions to multiple text vocabulary According to or the part-of-speech information of multiple text vocabulary screened, after screening obtains the association vocabulary for filtering out, using sieve The association vocabulary selected updates related word set.

By above-described embodiment, the network text that can be crawled to indifference carries out participle and screening, is filtered out Association vocabulary to update related word set, repeating carries out participle and screening, constantly expands and update related word Set, so as to the method vocabulary for solving the problems, such as existing word bag accumulation is on the low side, reaches and improves object to be analyzed The effect of related word set.

In above-described embodiment, can be by based on a large amount of network texts that indifference is crawled, setting up initial corpus. Network text in the initial corpus is carried out after participle, the dictionary vocabulary after participle is calculated in certain method (i.e. above-mentioned Text vocabulary) and the relevance between analysis object oriented (i.e. above-mentioned related word), and by rational word Converge screening logic, filter out qualified dictionary vocabulary (the i.e. above-mentioned text vocabulary for meeting default screening conditions, I.e. above-mentioned association vocabulary) composition word bag.The word bag can constantly be expanded by repetition above step, be improved for dividing The word bag content (i.e. above-mentioned related word set) of analysis object.

Specifically, indifference crawls and can refer to and be not provided with particular keywords, and the content of network upgrade in a period of time is complete Climb down and in portion.Such as climb daily once, the content such as the previous day newly-increased article, comment will all crawl down on website Come, for the content for having got over, be not repeated to crawl.

Alternatively, the related word in the related word set based on object to be analyzed crawls net from target data source Before network text, the analysis object oriented (pass in the related word set of i.e. above-mentioned object to be analyzed can be first determined Connection word), specifically, it is determined that the object to be analyzed, can be referred to as initial word bag content by its name.

In an optional embodiment, after crawling and obtaining network text, initial corpus can be set up.For It is determined that object to be analyzed (related word in the related word set of i.e. above-mentioned object to be analyzed), from its number of targets A certain amount of content of text (i.e. above-mentioned network is crawled according to indifference on source (for example, website, forum, mhkc etc.) Text), as the initial corpus for analysis object.Amount of text contained by initial corpus is bigger, is more conducive to improving The accuracy that following relevances are calculated.

Alternatively, participle is carried out to network text and obtains multiple text vocabulary, and obtain the vocabulary letter of each text vocabulary Breath includes：After participle being carried out to network text and obtains multiple text vocabulary, the text word of multiple text vocabulary is created Allusion quotation；Determine the coupling index data of each text vocabulary in text dictionary according to default Correlation Criteria, and/or extract text The part-of-speech information of each text vocabulary in dictionary.

In the above-described embodiments, carry out participle in the network text got to swashing from target data source and obtain multiple texts After vocabulary, the text dictionary of multiple text vocabulary is created, and determine in text dictionary each according to default Correlation Criteria The coupling index data of text vocabulary and current related word, or extract the word of each text vocabulary in text dictionary Property information, or each text vocabulary and current related word in text dictionary is determined according to default Correlation Criteria While coupling index data, the part-of-speech information of each text vocabulary in text dictionary is extracted.Then according to default screening Condition is screened to the coupling index data of multiple text vocabulary or the part-of-speech information of multiple text vocabulary, is screened The association vocabulary for going out, reuses the association vocabulary for filtering out and updates related word set.

By above-described embodiment, text dictionary can be created after participle come the lexical information of recording text vocabulary, from And the extraction of lexical information to text vocabulary is facilitated, realize and rapidly and accurately obtain information and carry out word bag accumulation Effect.

Specifically, can be using the network text for crawling as initial corpus, then to the text in the initial corpus This content (i.e. network text) carries out participle, builds comprising (the i.e. text of all vocabulary in text (i.e. network text) Vocabulary) dictionary (i.e. text dictionary).

Alternatively, determine that the coupling index data of each text vocabulary in text dictionary include according to default Correlation Criteria： If default Correlation Criteria is one, the relevance numerical value of each default Correlation Criteria of text vocabulary correspondence is obtained, obtained The coupling index data of each text vocabulary；If default Correlation Criteria is multiple, each text vocabulary correspondence is obtained each All relevance numerical value of each text vocabulary are made mixing operation by the relevance numerical value of individual default Correlation Criteria, will be melted With result as each text vocabulary coupling index data, wherein, mixing operation include weighted calculation, plus and calculate At least one of and multiplication and division calculating.

In the above-described embodiments, carry out participle in the network text got to swashing from target data source and obtain multiple texts After vocabulary, the text dictionary of multiple text vocabulary is created, can be determined according to default Correlation Criteria each in text dictionary The coupling index data of individual text vocabulary and current related word, if also, default Correlation Criteria is one, leads to The relevance numerical value that default Correlation Criteria calculates each text vocabulary is crossed, each text vocabulary and current conjunctive word is obtained The coupling index data of language；If default Correlation Criteria is multiple, corresponding each the default association of each text vocabulary is obtained All relevance numerical value of each text vocabulary are made mixing operation by the relevance numerical value of condition, using warm result as The coupling index data of each text vocabulary, then according to default coupling index number of the screening conditions to multiple text vocabulary According to or the part-of-speech information of multiple text vocabulary screened, obtain the association vocabulary for filtering out, reuse the pass for filtering out Connection vocabulary updates related word set.

By above-described embodiment, can be obtained using the default Correlation Criteria of different weights each text vocabulary with it is current Related word coupling index data, such that it is able to reach neatly obtain coupling index data effect.

Specifically, in the above-described embodiments mixing operation can include weighted calculation, plus and calculate and multiplication and division calculate in At least one of.For example, when mixing operation includes weighted calculation, even preset Correlation Criteria for multiple, then can be with The condition weight of default Correlation Criteria is obtained, the relevance number of each text vocabulary is calculated by each default Correlation Criteria Value, to each condition weight and corresponding relevance numerical value weighted calculation is made, and obtains the coupling index of each text vocabulary Data.

Alternatively, determine that the coupling index data of each text vocabulary in text dictionary can be wrapped according to default Correlation Criteria Include：Each text vocabulary is met the coupling index data of the number of times of default Correlation Criteria as each text vocabulary, its In, default Correlation Criteria includes：Each text vocabulary occurs simultaneously with related word in the same sentence of network text； And/or each text vocabulary and related word network text is occurred in identical part of speech in network text sentence in Same position.

In the above-described embodiments, the coupling index number of each text vocabulary and current related word in text dictionary is determined According to the default Correlation Criteria of institute's reference, can include：Each text vocabulary is with current related word in network text The number of times occurred simultaneously in same sentence；Or each text vocabulary and current related word in network text with phase The number of times of the same position in the sentence of network text is occurred in part of speech；Or the group of the default Correlation Criteria of above-mentioned two Close, the number of times that as each text vocabulary occurs simultaneously with current related word in the same sentence of network text, The sentence in network text is occurred in identical part of speech in network text with current related word with each text vocabulary The number of times of middle same position.By above-described embodiment, can efficiently and accurately be determined by above-mentioned default Correlation Criteria The coupling index data of each text vocabulary and current related word in text dictionary.

Same position in above-described embodiment is specifically as follows：In each sentence of network text with identical word away from From identical position, such as text vocabulary (as decayed tooth) in sentence with the current conjunctive word of identical (such as Coca-Cola) Position of the distance within five words, then the position of the text vocabulary (as decayed tooth) in different sentences can be considered as Identical position；Or, the same position in above-described embodiment can also be specifically：In each sentence of network text In identical word in the range of position, such as in different sentences, identical text vocabulary is both present in first five of sentence In individual word, then text vocabulary can be regarded as with identical position.

Specifically, to dictionary vocabulary (each i.e. above-mentioned text vocabulary) and analysis object oriented (i.e. above-mentioned association Word) relevance (i.e. above-mentioned coupling index data) when calculating, can calculate text word by default Correlation Criteria (i.e. above-mentioned association refers to relevance in allusion quotation between contained text vocabulary and analysis object oriented (i.e. above-mentioned related word) Mark data), default Correlation Criteria can be including but not limited to following default Correlation Criteria：

Default Correlation Criteria 1：Dictionary vocabulary (each i.e. above-mentioned text vocabulary) is (i.e. above-mentioned with analysis object oriented Related word) occur simultaneously in a word (or a section word, article etc.) of network text.

For example, related word is Coca-Cola, and the text vocabulary in dictionary includes Sprite, then the default Correlation Criteria is： Sprite is with Coca-Cola while occur, statistics Sprite occurs with Coca-Cola in same a word simultaneously in a word Situation number of times, using the number of times as coupling index data.If in the sentence in network text, Sprite with it is good to eat Cola situation about occurring while same a word occurs in that 5 times, then Sprite is with the coupling index data of Coca-Cola 5。

Default Correlation Criteria 2：Dictionary vocabulary (each i.e. above-mentioned text vocabulary) is (i.e. above-mentioned with analysis object oriented Related word) situation of sentence same position is occurred in same part of speech in network text.

For example, if related word is Coca-Cola, the text vocabulary in dictionary includes Sprite, the first of network text " Coca-Cola is good " is occurred in that in individual sentence, " Sprite is bad " is occurred in that in the second sentence, then Sprite and Coca-Cola The same position (such as the stem of sentence) of sentence is occurred in same part of speech (such as noun) in network text, now, The number of times of all words (such as Sprite) for meeting above-mentioned situation of statistics.

Calculating the default Correlation Criteria of coupling index data can choose one default Correlation Criteria of the above, or with multiple pre- If Correlation Criteria is combined, the different weight calculations of setting go out final relevance numerical value (i.e. above-mentioned coupling index data), Wherein, relevance numerical value is with the relation of correlation：The more high then text vocabulary of relevance numerical value is associated with related word Property is bigger.

Alternatively, according to default screening conditions to the coupling index data of multiple text vocabulary and/or multiple text vocabulary Part-of-speech information is screened, and the association vocabulary for obtaining filtering out includes：By text of the coupling index data in preset range This vocabulary is used as the association vocabulary for filtering out；Or coupling index data are arranged in the coupling index data of multiple text vocabulary Name front N names text vocabulary as the association vocabulary for filtering out；Or by text vocabulary that lexical information is default part of speech As the association vocabulary for filtering out.

In the above-described embodiments, the related word in related word set of the web crawlers based on object to be analyzed is from mesh Crawl after network text in mark data source, participle is carried out to network text and obtains multiple text vocabulary, and obtain each The coupling index data of multiple text vocabulary are screened by the lexical information of text vocabulary according to default screening conditions, Or the part-of-speech information of multiple text vocabulary is screened, or the coupling index data and multiple texts to multiple text vocabulary The part-of-speech information of this vocabulary is screened, wherein, screening can be by the text by coupling index data in preset range This vocabulary is remitted as the conjunctive word for filtering out and carried out, or is referred to associating in the coupling index data of multiple text vocabulary The text vocabulary that data rank is marked in front N names is used as the association vocabulary for filtering out, or is default part of speech by lexical information Then text vocabulary update related word set as the association vocabulary for filtering out using the association vocabulary for filtering out.It is logical Above-described embodiment is crossed, different default screening conditions can be arranged to screen to associating vocabulary, such that it is able to realize Flexibly and effectively screen, while the different screening requirements of client can be met.

Specifically, it is determined that the default screening conditions of word bag vocabulary (i.e. above-mentioned related word set) can be included but do not limited In following conditions：

Optionally presetting screening conditions for first is：Relevance numerical value (i.e. above-mentioned coupling index data) is in a certain interval (value such as coupling index data is pre- at two more than certain threshold value, or the value of coupling index data for interior all text vocabulary If numerical value between situations such as).

Optionally presetting screening conditions for second is：Relevance (i.e. above-mentioned coupling index data) ranking is in front N names All text vocabulary.

Optionally presetting screening conditions for 3rd is：The text vocabulary of certain specified part of speech.

According to above-mentioned default screening conditions to the coupling index data of multiple text vocabulary or the part of speech of multiple text vocabulary Information is screened, wherein, the default screening conditions of selection can be one of default screening conditions of the above, or Multiple default screening conditions are used in combination, and take the common factor of the association vocabulary for filtering out as related word set.

In an optional embodiment, according to default screening conditions to the coupling index data of multiple text vocabulary or Before the part-of-speech information of multiple text vocabulary is screened, can be to dictionary vocabulary (each i.e. above-mentioned text vocabulary) Carry out with the relevance measuring and calculating value (i.e. above-mentioned coupling index data) of analysis object oriented (i.e. above-mentioned related word) Sequence.Specifically, text vocabulary in text dictionary (is gone up with presetting the relevance index that Correlation Criteria is acquired State coupling index data) it is ranked up from high to low, as follow-up screening content.

Alternatively, updating related word set using the association vocabulary for filtering out includes：Using the association vocabulary for filtering out Related word is replaced, to update related word set；Or the association vocabulary for filtering out is added into into related word set, To update related word set.

Specifically, using the association vocabulary that filters out as word bag vocabulary, set up the word bag for object to be analyzed and (go up The related word set stated).The word bag (i.e. above-mentioned related word set) can also be used for circulating above-mentioned mistake next time Cheng Shi, substitution analysis object oriented (i.e. above-mentioned related word), to dictionary vocabulary (i.e. above-mentioned text vocabulary) Relevance is calculated, greatly expands analysis subject word bag (i.e. above-mentioned related word set), and improve constantly pass The accuracy that connection property (coupling index data) is calculated.

In an optional embodiment, as shown in Fig. 2 the processing method of related word set specifically can include as Lower step：

Step S202, determines the related word in the related word set of object to be analyzed.

Specifically, it is determined that want object to be analyzed, the name that can be analysed to object is referred to as initial word bag content and (closes Current related word in connection set of words).

Step S203, crawls network text, sets up initial corpus.

Specifically, the current related word in the related word set of object to be analyzed can be based on from target data source On crawl network text, wherein, target data source can include website, forum and mhkc etc..

Step S204, participle is carried out to network text, builds text dictionary.

Specifically, participle can be carried out to network text and obtains multiple text vocabulary, and obtain the word of each text vocabulary Remittance information, wherein, lexical information includes coupling index data of each text vocabulary and current related word and/or each The part-of-speech information of individual text vocabulary, then builds the text dictionary comprising all text vocabulary in network text.

Step S205, calculates the coupling index data of each text vocabulary in text dictionary and related word.

Specifically, coupling index data or multiple text vocabulary that can be according to default screening conditions to multiple text vocabulary Part-of-speech information screened, obtain the association vocabulary for filtering out.

The coupling index data of each text vocabulary in text dictionary are ranked up by step S206.

Specifically, can by the measuring and calculating value of the coupling index data of the text vocabulary of each in text dictionary according to from height to Low order sequence, in order to follow-up screening process.

Alternatively, to dictionary vocabulary (each i.e. above-mentioned text vocabulary) and analysis object oriented (i.e. above-mentioned association Word) relevance (i.e. above-mentioned coupling index data) when calculating, can calculate text word by default Correlation Criteria (i.e. above-mentioned association refers to relevance in allusion quotation between contained text vocabulary and analysis object oriented (i.e. above-mentioned related word) Mark data), default Correlation Criteria can be including but not limited to：

With analysis object oriented (i.e. above-mentioned related word) network text a word (or one section words, an article Deng) the interior number of times for occurring simultaneously.

The same position of sentence is occurred in same part of speech in network text with analysis object oriented (i.e. above-mentioned related word) The situation number of times put.

Calculating the default Correlation Criteria of coupling index data can choose one default Correlation Criteria of the above, or with multiple pre- If Correlation Criteria is combined, the different weight calculations of setting go out final relevance numerical value (i.e. above-mentioned coupling index data), Wherein, relevance numerical value is with the relation of correlation：The more high then text vocabulary of relevance numerical value and current related word Relevance it is bigger.

Step S207, the default screening conditions of setting, screens to the text vocabulary in text dictionary.

Step S208, sets up related word set.

Specifically, it is possible to use the association vocabulary for filtering out updates related word set.

Compared with existing word bag accumulation method, the method for the word bag of the employing of the above embodiments of the present application accumulation it is excellent Gesture is：Vocabulary growth rate in related word set is fast, and word bag accumulation efficiency is obviously improved；Word bag vocabulary is (i.e. Association vocabulary) can quantify to weigh with analyzing whether to be truly present to associate between object (i.e. related word)；Word bag word The default Correlation Criteria for converging (associating vocabulary) and analyzing the relevance calculating between object (i.e. related word) can be flexible Setting, and can be calculated in the form of conditional combination；Carry out again after can sorting according to the value of coupling index data Screening, so as to can flexibly set its default screening conditions, and screened in the form of screening conditions with combining to be preset； Also can be by being circulated operation to upper predicate bag cumulative process, the word bag of the output in the cycle of the above one (i.e. related word Set) replace the analysis object oriented (related word) in this cycle, can iterate the word bag accumulation flow process for carrying out, So as to realize constantly expanding word bag content (i.e. the content of related word set), improve word bag content accuracy and expand it The effect of coverage rate.

Embodiment 2

According to the embodiment of the present application, a kind of embodiment of the processing meanss of related word set is additionally provided, such as Fig. 3 institutes Show, the processing meanss include：Crawl unit 10, processing unit 30, screening unit 50 and updating block 70.

Wherein, unit 10 is crawled, for the related word in the related word set based on object to be analyzed from number of targets According to crawling network text on source.

Processing unit 30, for carrying out participle to network text multiple text vocabulary are obtained, and obtain each text vocabulary Lexical information, wherein, the coupling index data of lexical information including each text vocabulary and/or each text vocabulary Part-of-speech information, coupling index data are used to indicate the degree of association of each text vocabulary and related word.

Screening unit 50, for according to default screening conditions to the coupling index data of multiple text vocabulary and/or multiple The part-of-speech information of text vocabulary is screened, and obtains the association vocabulary for filtering out.

Updating block 70, for updating related word set using the association vocabulary for filtering out.

Alternatively, processing unit includes：Creation module and determining module.

Wherein, creation module, for after participle being carried out to network text and obtains multiple text vocabulary, creating multiple The text dictionary of text vocabulary；Determining module, for determining each text word in text dictionary according to default Correlation Criteria The coupling index data of remittance, and/or extract the part-of-speech information of each text vocabulary in text dictionary.

Using the embodiment of the present application, the current association in related word set of the web crawlers based on object to be analyzed Word is crawled after network text from target data source, participle is carried out to network text and obtains multiple text vocabulary, and Obtain the lexical information of each text vocabulary, and the coupling index number according to default screening conditions to multiple text vocabulary According to or the part-of-speech information of multiple text vocabulary screened, after screening obtains the association vocabulary for filtering out, using sieve The association vocabulary selected updates related word set.By above-described embodiment, the network text that can be crawled to indifference Originally participle and screening are carried out, the association vocabulary for filtering out is obtained to update related word set, repeat carry out participle and Screening, constantly expands and updates related word set, so as to the method vocabulary for solving existing word bag accumulation is on the low side Problem, reach the effect of the related word set for improving object to be analyzed.

Optionally it is determined that module includes：First calculating sub module and the second calculating sub module.

Wherein, the first calculating sub module, if being one for default Correlation Criteria, obtains each text vocabulary correspondence The relevance numerical value of default Correlation Criteria, obtains the coupling index data of each text vocabulary；Second calculating sub module, If being multiple for default Correlation Criteria, the relevance number of corresponding each the default Correlation Criteria of each text vocabulary is obtained All relevance numerical value of each text vocabulary are made mixing operation, using warm result as each text vocabulary by value Coupling index data, wherein, mixing operation includes weighted calculation, adds and at least one of calculating and multiplication and division calculating.

In the above-described embodiments, carry out participle in the network text got to swashing from target data source and obtain multiple texts After vocabulary, the text dictionary of multiple text vocabulary is created, can be determined according to default Correlation Criteria each in text dictionary The coupling index data of individual text vocabulary and current related word, if also, default Correlation Criteria is one, leads to The relevance numerical value that default Correlation Criteria calculates each text vocabulary is crossed, each text vocabulary and current conjunctive word is obtained The coupling index data of language；If default Correlation Criteria is multiple, corresponding each the default association of each text vocabulary is obtained All relevance numerical value of each text vocabulary are made mixing operation by the relevance numerical value of condition, using warm result as The coupling index data of each text vocabulary, then according to default coupling index number of the screening conditions to multiple text vocabulary According to or the part-of-speech information of multiple text vocabulary screened, obtain the association vocabulary for filtering out, reuse the pass for filtering out Connection vocabulary updates related word set.By above-described embodiment, can be obtained using the default Correlation Criteria of different weights The coupling index data of each text vocabulary and current related word are taken, is referred to such that it is able to reach neatly acquisition association The effect of mark data.

Optionally it is determined that module can include：Determination sub-module, for each text vocabulary to be met into default association bar The number of times of part as each text vocabulary coupling index data, wherein, default Correlation Criteria includes：Each text word Converge and occur simultaneously in the same sentence of network text with related word；And/or each text vocabulary and related word are in net The same position in the sentence of network text is occurred in network text with identical part of speech.

Alternatively, screening unit can include：First screening module, the second screening module and the 3rd screening module. Wherein, the first screening module, for the text vocabulary using coupling index data in preset range as the pass for filtering out Connection vocabulary；Or second screening module, for the coupling index data rank in the coupling index data of multiple text vocabulary Front N names text vocabulary as the association vocabulary for filtering out；Or the 3rd screening module, for being pre- by lexical information If the text vocabulary of part of speech is used as the association vocabulary for filtering out.

In the above-described embodiments, the current conjunctive word in related word set of the web crawlers based on object to be analyzed Language is crawled after network text from target data source, participle is carried out to network text and obtains multiple text vocabulary, and is obtained The lexical information of each text vocabulary is taken, the coupling index data of multiple text vocabulary are carried out according to default screening conditions Screening, or the part-of-speech information of multiple text vocabulary is screened, or the coupling index data to multiple text vocabulary and The part-of-speech information of multiple text vocabulary is screened, wherein, screening can by by coupling index data in preset range Interior text vocabulary is remitted as the conjunctive word for filtering out and carried out, or by the coupling index data of multiple text vocabulary Coupling index data rank front N names text vocabulary as the association vocabulary for filtering out, or be default by lexical information Then the text vocabulary of part of speech update related word collection as the association vocabulary for filtering out using the association vocabulary for filtering out Close.By above-described embodiment, different default screening conditions can be arranged to screen to associating vocabulary, so as to can To realize flexibly and effectively screening, while the different screening requirements of client can be met.

Alternatively, updating block includes：First update module and the second update module.

First update module, for replacing related word using the association vocabulary for filtering out, to update related word set； Or second update module, for the association for filtering out vocabulary to be added into into related word set, to update related word collection Close.

The processing meanss of related word set include processor and memory, it is above-mentioned crawl unit 10, processing unit 30, Screening unit 50 and updating block 70 etc. are stored in memory as program unit, are stored in by computing device Said procedure unit in memory is realizing corresponding function.

Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can arrange one Or more, the network text crawled to indifference by adjusting kernel parameter carries out participle and screening, is screened To update related word set, repeat carries out participle and screening to the association vocabulary for going out, and constantly expands and update conjunctive word Language set, so as to the method vocabulary for solving the problems, such as existing word bag accumulation is on the low side, reaches and improves object to be analyzed Related word set effect.

Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.

Present invention also provides a kind of computer program, when performing on data processing equipment, is adapted for carrying out just The program code of beginningization there are as below methods step：Based on the related word in the related word set of object to be analyzed from mesh Network text is crawled in mark data source；Participle is carried out to network text and obtains multiple text vocabulary, and obtain each text The lexical information of vocabulary, wherein, lexical information include the coupling index data of each text vocabulary and related word and/ Or the part-of-speech information of each text vocabulary；According to coupling index data or many of the default screening conditions to multiple text vocabulary The part-of-speech information of individual text vocabulary is screened, and obtains the association vocabulary for filtering out；Using the association vocabulary for filtering out more New related word set.

Above-mentioned the embodiment of the present application sequence number is for illustration only, does not represent the quality of embodiment.

In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents, other can be passed through Mode realize.Wherein, device embodiment described above is only schematic, such as division of described unit, Can be a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Can with reference to or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, institute The coupling each other for showing or discussing or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.

The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme Purpose.

In addition, each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.

If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used When, during a computer read/write memory medium can be stored in.Based on such understanding, the technical scheme of the application The part for substantially contributing to prior art in other words or all or part of the technical scheme can be produced with software The form of product is embodied, and the computer software product is stored in a storage medium, including some instructions are to make Obtain a computer equipment (can be personal computer, server or network equipment etc.) and perform each enforcement of the application The all or part of step of example methods described.And aforesaid storage medium includes：USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic Dish or CD etc. are various can be with the medium of store program codes.

The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, on the premise of without departing from the application principle, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims

1. a kind of processing method of related word set, it is characterised in that include：

Network text is crawled from target data source based on the related word in the related word set of object to be analyzed This；

Participle is carried out to the network text and obtains multiple text vocabulary, and obtain the word of each text vocabulary Remittance information, wherein, the lexical information includes the coupling index data of each text vocabulary and/or each institute The part-of-speech information of text vocabulary is stated, the coupling index data are used to indicate each described text vocabulary and the pass The degree of association of connection word；

According to coupling index data and/or the plurality of text of the default screening conditions to the plurality of text vocabulary The part-of-speech information of vocabulary is screened, and obtains the association vocabulary for filtering out；

The related word set is updated using the association vocabulary for filtering out.

2. processing method according to claim 1, it is characterised in that participle is carried out to the network text and obtains many Individual text vocabulary, and obtain the lexical information of each text vocabulary and include：

After participle being carried out to the network text and obtains multiple text vocabulary, the plurality of text vocabulary is created Text dictionary；

Determine the coupling index data of each text vocabulary in the text dictionary according to default Correlation Criteria, And/or extract the part-of-speech information of each text vocabulary in the text dictionary.

3. processing method according to claim 2, it is characterised in that determine the text according to default Correlation Criteria The coupling index data of each text vocabulary include in dictionary：

If the default Correlation Criteria is one, the corresponding default association bar of each described text vocabulary is obtained The relevance numerical value of part, obtains the coupling index data of each text vocabulary；

If the default Correlation Criteria is multiple, corresponding each the described default association bar of each text vocabulary is obtained The relevance numerical value of part, to all described relevance numerical value of each text vocabulary mixing operation is made, and will be melted With result as each text vocabulary coupling index data, wherein, the mixing operation include weighting meter Calculate, add and at least one of calculating and multiplication and division calculating.

4. processing method according to claim 2, it is characterised in that determine the text according to default Correlation Criteria The coupling index data of each text vocabulary include in dictionary：

Each described text vocabulary is met the number of times of the default Correlation Criteria as text vocabulary each described Coupling index data,

Wherein, the default Correlation Criteria includes：Each described text vocabulary is with the related word in the net Occur simultaneously in the same sentence of network text；And/or each described text vocabulary with the related word in the net The same position in the sentence of the network text is occurred in network text with identical part of speech.

5. processing method according to claim 1, it is characterised in that according to default screening conditions to the plurality of text The part-of-speech information of the coupling index data of this vocabulary and/or the plurality of text vocabulary is screened, and is sieved The association vocabulary selected includes：

Text vocabulary using coupling index data in preset range is used as the association vocabulary for filtering out；Or

Text of the coupling index data rank in front N names described in the coupling index data of the plurality of text vocabulary This vocabulary is used as the association vocabulary for filtering out；Or

Using text vocabulary that the lexical information is default part of speech as the association vocabulary for filtering out.

6. processing method as claimed in any of claims 1 to 5, it is characterised in that using the institute for filtering out Stating the association vocabulary renewal related word set includes：

The related word is replaced using the association vocabulary for filtering out, to update the related word set； Or

The association vocabulary for filtering out is added into into the related word set, to update the related word collection Close.

7. a kind of processing meanss of related word set, it is characterised in that include：

Unit is crawled, for the related word in the related word set based on object to be analyzed from target data source On crawl network text；

Processing unit, for carrying out participle to the network text multiple text vocabulary are obtained, and obtain each institute The lexical information of text vocabulary is stated, wherein, the lexical information includes the coupling index of each text vocabulary The part-of-speech information of data and/or each text vocabulary, the coupling index data are used to indicate each text The degree of association of this vocabulary and the related word；

Screening unit, for according to default screening conditions to the coupling index data of the plurality of text vocabulary and/ Or the part-of-speech information of the plurality of text vocabulary is screened, the association vocabulary for filtering out is obtained；

Updating block, for updating the related word set using the association vocabulary for filtering out.

8. processing meanss according to claim 7, it is characterised in that the processing unit includes：

Creation module, for after participle being carried out to the network text and obtains multiple text vocabulary, creating institute State the text dictionary of multiple text vocabulary；

Determining module, for determining each text vocabulary in the text dictionary according to default Correlation Criteria Coupling index data, and/or extract the part-of-speech information of each text vocabulary in the text dictionary.

9. processing meanss according to claim 8, it is characterised in that the determining module includes：

First calculating sub module, if being one for the default Correlation Criteria, obtains each text word The relevance numerical value of the remittance correspondence default Correlation Criteria, obtains the coupling index data of each text vocabulary；

Second calculating sub module, if being multiple for the default Correlation Criteria, obtains each text vocabulary pair Answer the relevance numerical value of each default Correlation Criteria, all described relevance to each text vocabulary Numerical value makees mixing operation, using warm result as each text vocabulary coupling index data, wherein, institute Stating mixing operation includes weighted calculation, adds and at least one of calculating and multiplication and division calculating.

10. processing meanss according to claim 8, it is characterised in that the determining module includes：

Determination sub-module, for each described text vocabulary to be met the number of times of the default Correlation Criteria as each The coupling index data of the individual text vocabulary,