CN106649334B

CN106649334B - Processing method and device of associated word set

Info

Publication number: CN106649334B
Application number: CN201510726038.1A
Authority: CN
Inventors: 梁梦溪; 何鑫
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2020-09-15
Anticipated expiration: 2035-10-29
Also published as: CN106649334A

Abstract

The application discloses a method and a device for processing a related word set. The processing method comprises the following steps: crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed; the method comprises the steps of segmenting a network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used for indicating the association degree of each text vocabulary and associated words; screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies; and updating the associated word set by using the screened associated vocabulary. The method and the device solve the technical problem that the vocabulary amount is less in the existing method for word packet accumulation.

Description

Processing method and device of associated word set

Technical Field

The application relates to the field of internet, in particular to a method and a device for processing a related word set.

Background

When an enterprise issues a product, pushes out a service, or a government issues a policy, and a certain instant event which causes social attention occurs, some contents such as news related to network media reports appear on the internet, and the network news will cause the attention and discussion of netizens. In the process of collecting internet public opinion content (namely, internet text related to an object) aiming at a certain analysis object (such as current affairs, products, people, policies and the like), if a network crawler crawls the internet text related to the analysis object to collect information, because whether the content is related to the analysis object or not is not distinguished in the crawling process, the content needs to be screened after the internet text is crawled to filter out the content related to the object to be analyzed.

Generally, in the process of screening and filtering web texts, whether a section of web text is related to an object to be analyzed is judged by setting certain judgment conditions, a set of contents related to the object to be analyzed is used as a "word package", and the contents in the "word package" are used to replace the object to be analyzed to screen and filter the web text process, which may also be referred to as word package accumulation.

The existing basic method for word packet accumulation is manual input by artificial association, and the combination method of the following vocabularies is mostly adopted, namely, the name of an object to be analyzed is used as a word packet; taking the combination of the name of the object to be analyzed and the synonym as a word packet; and taking the combination of the object name to be analyzed and the competitive product word as a word packet. It can be seen that the existing word packet accumulation method has the disadvantages that: the vocabulary is less; whether the relation between the vocabulary and the analysis object is tight or not cannot be quantitatively measured; the time required by accumulation of the vocabularies participated manually is long, and the efficiency is low; and poor scalability.

Aiming at the problem that the vocabulary of the existing method for accumulating word packets is less, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing a related word set, which at least solve the technical problem that the vocabulary of the existing word packet accumulation method is less.

According to an aspect of an embodiment of the present application, there is provided a processing method for associating a word set, the processing method including: crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed; the method comprises the steps of segmenting a network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used for indicating the association degree of each text vocabulary and associated words; screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies; and updating the associated word set by using the screened associated vocabulary.

Further, performing word segmentation on the web text to obtain a plurality of text words, and acquiring word information of each text word includes: after the network text is subjected to word segmentation to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies; and determining the association index data of each text vocabulary in the text dictionary according to a preset association condition, and/or extracting the part of speech information of each text vocabulary in the text dictionary.

Further, determining the association index data of each text vocabulary in the text dictionary according to the preset association condition comprises: if the preset association condition is one, acquiring an association numerical value of each text vocabulary corresponding to the preset association condition to obtain association index data of each text vocabulary; if the preset association condition is multiple, acquiring an association numerical value of each text vocabulary corresponding to each preset association condition, performing fusion operation on all the association numerical values of each text vocabulary, and taking a fusion result as association index data of each text vocabulary, wherein the fusion operation comprises at least one of weighted calculation, addition calculation and multiplication-division calculation.

Further, determining the association index data of each text vocabulary in the text dictionary according to the preset association condition comprises: and taking the times of each text vocabulary meeting preset association conditions as association index data of each text vocabulary, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.

Further, screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition, and obtaining the screened associated vocabularies comprises: taking the text vocabulary of which the associated index data are in a preset range as the screened associated vocabulary; or the related index data of the plurality of text vocabularies is ranked in the top N text vocabularies as the screened related vocabularies; or taking the text vocabulary with the vocabulary information being the preset part of speech as the screened associated vocabulary.

Further, updating the set of associated terms using the filtered out associated vocabulary includes: replacing the associated words by using the screened associated words to update the associated word set; or adding the screened associated words into the associated word set to update the associated word set.

According to another aspect of the embodiments of the present application, there is also provided a processing apparatus for associating word sets, the processing apparatus including: the crawling unit is used for crawling the network text from the target data source based on the associated words in the associated word set of the object to be analyzed; the processing unit is used for segmenting the network text to obtain a plurality of text vocabularies and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used for indicating the association degree of each text vocabulary and the associated words; the screening unit is used for screening the associated index data of the text vocabularies and/or the part of speech information of the text vocabularies according to preset screening conditions to obtain screened associated vocabularies; and the updating unit is used for updating the associated word set by using the screened associated vocabulary.

Further, the processing unit includes: the creating module is used for creating a text dictionary of a plurality of text vocabularies after the network text is subjected to word segmentation to obtain a plurality of text vocabularies; and the determining module is used for determining the association index data of each text vocabulary in the text dictionary according to the preset association condition and/or extracting the part of speech information of each text vocabulary in the text dictionary.

Further, the determining module includes: the first calculation submodule is used for acquiring the relevance value of each text vocabulary corresponding to the preset relevance condition if the preset relevance condition is one, and obtaining the relevance index data of each text vocabulary; and the second calculation submodule is used for acquiring the relevance numerical value of each text vocabulary corresponding to each preset relevance condition if the preset relevance condition is multiple, performing fusion operation on all the relevance numerical values of each text vocabulary, and taking the fusion result as the relevance index data of each text vocabulary, wherein the fusion operation comprises at least one of weighting calculation, addition calculation and multiplication-division calculation.

Further, the determining module includes: the determining submodule is used for taking the times of the text vocabularies meeting the preset association conditions as the association index data of the text vocabularies, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.

In the embodiment of the application, after a web crawler crawls a web text from a target data source based on related words in a related word set of an object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, related index data of the text words or part-of-speech information of the text words is screened according to preset screening conditions, and after the screened related words are obtained through screening, the related word set is updated by using the screened related words. Through the embodiment, the network text obtained through indifference crawling can be subjected to word segmentation and screening to obtain the screened associated words so as to update the associated word set, then the word segmentation and screening are repeatedly performed, and the associated word set is continuously expanded and updated, so that the problem that the existing word packet accumulation method is small in vocabulary quantity is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow diagram of a method of processing a set of related words according to an embodiment of the present application;

FIG. 2 is a flow diagram of another alternative method of processing associated word sets in accordance with an embodiment of the present application; and

FIG. 3 is a schematic diagram of a processing device for associating word sets according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The noun explains:

analysis of the object: based on the web text content, the object of the public opinion content is to be analyzed. Possibly current events, products, people, policies, etc.

Corpus: web text crawled by crawlers.

Dictionary vocabulary: and after the words are segmented, the texts in the corpus are stored in a single word and a word library in a relationship form among the words.

Relevance: refers to the degree of closeness between multiple objects (words).

Screening logic: conditional algorithms to filter vocabulary.

Word bag: the method is used for replacing the analysis object to be used as a set consisting of vocabularies for screening the network texts in the corpus and filtering the contents related to the analysis object.

Example 1

In accordance with an embodiment of the present application, there is provided an embodiment of a method of processing a set of words associated, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of a processing method for associating word sets according to an embodiment of the present application, and as shown in fig. 1, the processing method includes the following steps:

step S102, crawling the network text from the target data source based on the associated words in the associated word set of the object to be analyzed.

Step S104, performing word segmentation on the network text to obtain a plurality of text words, and acquiring word information of each text word, wherein the word information comprises associated index data of each text word and/or part-of-speech information of each text word, and the associated index data is used for indicating the association degree of each text word and the associated word.

And step S106, screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to preset screening conditions to obtain screened associated vocabularies.

And step S108, updating the associated word set by using the screened associated vocabulary.

By adopting the method and the device, after the web crawler crawls the web text from the target data source based on the current associated words in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, associated index data of the text words or part of speech information of the text words is screened according to preset screening conditions, and after the screened associated words are obtained through screening, the associated word set is updated by using the screened associated words.

Through the embodiment, the network text obtained through indifference crawling can be subjected to word segmentation and screening to obtain the screened associated words so as to update the associated word set, then the word segmentation and screening are repeatedly performed, and the associated word set is continuously expanded and updated, so that the problem that the existing word packet accumulation method is small in vocabulary quantity is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.

In the above embodiment, the initial corpus may be established based on a large amount of web texts crawled without difference. After the network text in the initial language database is segmented, the relevance between the dictionary vocabulary after the segmentation (namely the text vocabulary) and the name of the analysis object (namely the relevant vocabulary) is measured and calculated in a certain method, and the dictionary vocabulary meeting the conditions (namely the text vocabulary meeting the preset screening conditions, namely the relevant vocabulary) is screened out through reasonable vocabulary screening logic to form a word packet. The word packet can be continuously expanded by repeating the steps, and the word packet content (namely the related word set) aiming at the analysis object is perfected.

Specifically, the indifferent crawling may refer to crawling all updated contents of the website within a period of time without setting a specific keyword. For example, the crawling is performed once a day, namely, the newly added articles, comments and other contents on the website in the previous day are all crawled, and the crawled contents are not repeatedly crawled.

Optionally, before crawling the web text from the target data source based on the associated word in the associated word set of the object to be analyzed, the name of the analysis object (i.e. the associated word in the associated word set of the object to be analyzed) may be determined, and specifically, the name of the object to be analyzed may be determined as the initial content of the word package.

In an alternative embodiment, after crawling web text, an initial corpus may be established. For a determined object to be analyzed (i.e., an associated word in the associated word set of the object to be analyzed), a certain amount of text content (i.e., the web text described above) is crawled indiscriminately from its target data source (e.g., a website, a forum, a post, etc.) as an initial corpus for the object to be analyzed. The larger the amount of text contained in the initial corpus is, the more advantageous the accuracy of the following relevance calculation is.

Optionally, the segmenting the web text to obtain a plurality of text vocabularies, and the obtaining vocabulary information of each text vocabulary includes: after the network text is subjected to word segmentation to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies; and determining the association index data of each text vocabulary in the text dictionary according to a preset association condition, and/or extracting the part of speech information of each text vocabulary in the text dictionary.

In the above embodiment, after the web text crawled from the target data source is segmented to obtain a plurality of text vocabularies, a text dictionary of the plurality of text vocabularies is created, association index data of each text vocabulary in the text dictionary and the current associated word is determined according to a preset association condition, or part-of-speech information of each text vocabulary in the text dictionary is extracted while association index data of each text vocabulary in the text dictionary and the current associated word is determined according to the preset association condition. And then screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies, and updating the associated word set by using the screened associated vocabularies.

Through the embodiment, the vocabulary information of the text vocabulary can be recorded by creating the text dictionary after word segmentation, so that the extraction of the vocabulary information of the text vocabulary is facilitated, and the effects of quickly and accurately acquiring information and accumulating word packets are realized.

Specifically, the crawled web texts may be used as an initial corpus, and then the text content (i.e., web texts) in the initial corpus is participled to construct a dictionary (i.e., text dictionary) containing all words (i.e., text words) in the text (i.e., web texts).

Optionally, determining association index data of each text vocabulary in the text dictionary according to a preset association condition includes: if the preset association condition is one, acquiring an association numerical value of each text vocabulary corresponding to the preset association condition to obtain association index data of each text vocabulary; if the preset association condition is multiple, acquiring an association numerical value of each text vocabulary corresponding to each preset association condition, performing fusion operation on all the association numerical values of each text vocabulary, and taking a fusion result as association index data of each text vocabulary, wherein the fusion operation comprises at least one of weighted calculation, addition calculation and multiplication-division calculation.

In the above embodiment, after performing word segmentation on a web text crawled from a target data source to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies, determining association index data of each text vocabulary in the text dictionary and a current associated word according to a preset association condition, and if the preset association condition is one, calculating an association numerical value of each text vocabulary according to the preset association condition to obtain association index data of each text vocabulary and the current associated word; if the preset association condition is multiple, acquiring the association numerical value of each text vocabulary corresponding to each preset association condition, fusing all the association numerical values of each text vocabulary, taking the fused result as the association index data of each text vocabulary, screening the association index data of the text vocabularies or the part of speech information of the text vocabularies according to the preset screening condition to obtain the screened association vocabularies, and updating the association word set by using the screened association vocabularies.

Through the embodiment, the association index data of each text vocabulary and the current association terms can be acquired by adopting the preset association conditions with different weights, so that the effect of flexibly acquiring the association index data can be achieved.

Specifically, the fusion operation may include at least one of a weighting calculation, an addition calculation, and a multiplication-division calculation in the above-described embodiment. For example, when the fusion operation includes weighting calculation, that is, if the preset association condition is multiple, the condition weight of the preset association condition may be obtained, the association value of each text vocabulary is calculated through each preset association condition, and the association index data of each text vocabulary is obtained by performing weighting calculation on each condition weight and the corresponding association value.

Optionally, determining association index data of each text vocabulary in the text dictionary according to the preset association condition may include: and taking the times of each text vocabulary meeting preset association conditions as association index data of each text vocabulary, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.

In the above embodiment, determining the preset association condition referred to by the association index data of each text vocabulary and the current associated word in the text dictionary may include: the number of times that each text vocabulary and the current associated word appear in the same sentence of the web text at the same time; or the times that each text vocabulary and the current associated word appear at the same position in the sentence of the web text in the same part of speech in the web text; or the combination of the two preset association conditions is the number of times that each text vocabulary and the current associated word appear in the same sentence of the web text at the same time, and the number of times that each text vocabulary and the current associated word appear in the same position in the sentence of the web text with the same part of speech in the web text. Through the embodiment, the association index data of each text vocabulary and the current associated word in the text dictionary can be effectively and accurately determined through the preset association condition.

The same positions in the above embodiments may specifically be: positions in each sentence of the web text which are the same in distance with the same word, such as positions in which the distance of a text vocabulary (such as decayed tooth) in the sentence is within five words from the same currently associated word (such as coca-cola), the positions of the text vocabulary (such as decayed tooth) in different sentences can be regarded as the same positions; or, the same positions in the above embodiments may specifically be: locations within the same word range in each sentence of web text may be considered to have the same location if, in different sentences, the same text vocabulary all appears within the first five words of the sentence.

Specifically, when calculating the association between the dictionary vocabulary (i.e., each of the above-mentioned text vocabularies) and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data), the association between the text vocabulary contained in the text dictionary and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data) may be calculated by presetting an association condition, which may include, but is not limited to, the following preset association conditions:

presetting a correlation condition 1: the dictionary vocabulary (i.e., each text vocabulary described above) and the analysis object name (i.e., the associated word described above) occur simultaneously within a sentence (or a passage, an article, etc.) of the web text.

For example, if the associated word is coca-cola and the text vocabulary in the dictionary includes snow-Bill, the preset association condition is: the method comprises the steps of enabling the sprite and the coca-cola to appear in the same sentence, counting the times of the situation that the sprite and the coca-cola appear in the same sentence at the same time, and taking the times as correlation index data. If the situation that the sprite and the coca-cola appear simultaneously in the same sentence 5 times in the sentence in the web text, the data of the association index between the sprite and the coca-cola is 5.

Preset association condition 2: in the case where the dictionary vocabulary (i.e., each text vocabulary described above) and the analysis target name (i.e., the related word described above) appear at the same position in the sentence in the same part of speech in the web text.

For example, if the associated word is coca-cola, the text vocabulary in the dictionary includes the word "kobi", and "kobi-good" appears in the first sentence of the web text and "kobi-poor" appears in the second sentence, the word "kobi" and the word "kobi-poor" appear in the web text at the same position (e.g. the head of the sentence) with the same part of speech (e.g. noun), and at this time, the number of times of all the words (e.g. the word "kobi") meeting the above conditions is counted.

The preset association condition for calculating the association index data may be more than one preset association condition, or multiple preset association conditions are combined, different weights are set, and a final association value (i.e. the association index data) is calculated, where a relationship between the association value and the association is: the higher the relevance value, the greater the relevance of the text vocabulary to the associated word.

Optionally, the screening the relevance index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition, and the obtaining of the screened relevance vocabularies includes: taking the text vocabulary of which the associated index data are in a preset range as the screened associated vocabulary; or the related index data of the plurality of text vocabularies is ranked in the top N text vocabularies as the screened related vocabularies; or taking the text vocabulary with the vocabulary information being the preset part of speech as the screened associated vocabulary.

In the above embodiment, after the web crawler crawls the web text from the target data source based on the associated words in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, and word information of each text word is obtained, the associated index data of the text words is screened according to the preset screening condition, or the part-of-speech information of the text words is screened, or the associated index data of the text words and the part-of-speech information of the text words are screened, wherein the screening can be performed by collecting the text words with the associated index data in the preset range as the screened associated words, or using the text words with the associated index data ranked first N names in the associated index data of the text words as the screened associated words, or using the text words with the part-of-speech information as the screened associated words, the set of related words is then updated using the filtered related vocabulary. Through the embodiment, different preset screening conditions can be set to screen the associated vocabularies, so that flexible and effective screening can be realized, and different screening requirements of customers can be met.

Specifically, the preset filtering condition for determining the vocabulary of the word package (i.e. the above-mentioned associated word set) may include, but is not limited to, the following conditions:

the first optional preset screening condition is: all the text words of the relevance numerical value (i.e. the relevant index data) in a certain interval (for example, the value of the relevant index data is greater than a certain threshold, or the value of the relevant index data is between two preset numerical values, etc.).

The second optional preset screening condition is: all the text vocabularies with the top N names of relevance (namely the relevance index data).

The third optional preset screening condition is: some text vocabulary of a specified part of speech.

And screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to the preset screening conditions, wherein the selected preset screening condition can be one of the preset screening conditions or a combination of the preset screening conditions, and the intersection of the screened associated vocabularies is taken as an associated word set.

In an alternative embodiment, before the relevance index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies is filtered according to the preset filtering condition, the relevance measurement values (i.e. the relevance index data) of the dictionary vocabularies (i.e. the text vocabularies) and the analysis object names (i.e. the relevance words) can be sorted. Specifically, the relevance indexes (i.e., the relevance index data) obtained by the text words in the text dictionary under the preset relevance conditions are sorted from high to low to serve as the subsequent screening content.

Optionally, updating the set of associated words using the filtered associated vocabulary includes: replacing the associated words by using the screened associated words to update the associated word set; or adding the screened associated words into the associated word set to update the associated word set.

Specifically, the screened associated vocabulary is used as a vocabulary packet vocabulary, and a vocabulary packet (i.e., the above-mentioned associated vocabulary set) for the object to be analyzed is established. The word package (i.e., the related word set) can also be used to replace the analysis object name (i.e., the related word) in the next loop of the above process, so as to calculate the relevance of the dictionary vocabulary (i.e., the text vocabulary), to expand the analysis object word package (i.e., the related word set) to a greater extent, and to continuously improve the accuracy of the relevance (i.e., the relevance index data) calculation.

In an optional embodiment, as shown in fig. 2, the processing method of the related term set may specifically include the following steps:

step S202, determining the associated words in the associated word set of the object to be analyzed.

Specifically, the object to be analyzed is determined, and the name of the object to be analyzed may be used as the initial content of the word package (i.e., the current associated word in the associated word set).

Step S203, crawling the web text and establishing an initial corpus.

Specifically, the web text may be crawled from target data sources based on current associated words in the associated word set of the object to be analyzed, where the target data sources may include websites, forums, posts, and the like.

And step S204, performing word segmentation on the network text to construct a text dictionary.

Specifically, the network text may be segmented to obtain a plurality of text vocabularies, vocabulary information of each text vocabulary is obtained, where the vocabulary information includes associated index data of each text vocabulary and a current associated word and/or part-of-speech information of each text vocabulary, and then a text dictionary including all text vocabularies in the network text is constructed.

Step S205, measuring and calculating the association index data of each text vocabulary and associated words in the text dictionary.

Specifically, the associated index data of the plurality of text vocabularies or the part-of-speech information of the plurality of text vocabularies may be screened according to a preset screening condition to obtain the screened associated vocabularies.

Step S206, sorting the associated index data of each text vocabulary in the text dictionary.

Specifically, the measured values of the association index data of each text vocabulary in the text dictionary may be sorted in order from high to low, so as to facilitate the subsequent screening process.

Alternatively, when the association between the dictionary vocabulary (i.e., each of the above-mentioned text vocabularies) and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data) is calculated, the association between the text vocabulary contained in the text dictionary and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data) may be calculated by presetting an association condition, which may include, but is not limited to:

the number of times the analysis object name (i.e., the above-described associated word) co-occurs within a sentence (or a passage, an article, etc.) of the web text.

The number of times that the analysis object name (i.e., the above-described related word) appears in the same part of speech in the sentence within the web text.

The preset association condition for calculating the association index data may be more than one preset association condition, or multiple preset association conditions are combined, different weights are set, and a final association value (i.e. the association index data) is calculated, where a relationship between the association value and the association is: the higher the relevance value, the greater the relevance of the text vocabulary to the current associated word.

Step S207, setting preset screening conditions, and screening text vocabularies in the text dictionary.

And step S208, establishing a related word set.

In particular, the set of related words may be updated using the filtered out related vocabulary.

Compared with the existing word packet accumulation method, the word packet accumulation method adopted by the embodiment of the application has the advantages that: the vocabulary in the associated word set is increased quickly, and the word packet accumulation efficiency is obviously improved; whether the association exists between the word packet words (namely the associated words) and the analysis objects (namely the associated words) can be quantitatively measured; the preset association conditions for association calculation between the word packet words (namely, associated words) and the analysis objects (namely, associated words) can be flexibly set, and calculation can be performed in a condition combination mode; the method can sort the values of the associated index data and then carry out screening, thereby flexibly setting the preset screening conditions of the associated index data and carrying out screening in a mode of combining the preset screening conditions; the above-mentioned processes of word packet accumulation can also be operated circularly, the generated word packet (i.e. associated word set) in the above period replaces the name of the analysis object (associated word) in the present period, and the process of word packet accumulation can be iteratively performed repeatedly, so that the content of the word packet (i.e. the content of the associated word set) can be continuously expanded, the accuracy of the content of the word packet can be improved, and the coverage of the content of the word packet can be enlarged.

Example 2

According to an embodiment of the present application, there is further provided an embodiment of a processing apparatus for associating word sets, as shown in fig. 3, the processing apparatus includes: a crawling unit 10, a processing unit 30, a screening unit 50 and an updating unit 70.

The crawling unit 10 is configured to crawl a web text from a target data source based on associated terms in an associated term set of an object to be analyzed.

The processing unit 30 is configured to perform word segmentation on the web text to obtain a plurality of text words, and obtain word information of each text word, where the word information includes association index data of each text word and/or part-of-speech information of each text word, and the association index data is used to indicate association degrees of each text word and associated words.

And the screening unit 50 is configured to screen the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies.

And an updating unit 70 for updating the associated word set by using the screened associated vocabulary.

Optionally, the processing unit comprises: the device comprises a creating module and a determining module.

The system comprises a creating module, a searching module and a searching module, wherein the creating module is used for creating a text dictionary of a plurality of text vocabularies after the network text is subjected to word segmentation to obtain the plurality of text vocabularies; and the determining module is used for determining the association index data of each text vocabulary in the text dictionary according to the preset association condition and/or extracting the part of speech information of each text vocabulary in the text dictionary.

By adopting the method and the device, after the web crawler crawls the web text from the target data source based on the current associated words in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, associated index data of the text words or part of speech information of the text words is screened according to preset screening conditions, and after the screened associated words are obtained through screening, the associated word set is updated by using the screened associated words. Through the embodiment, the network text obtained through indifference crawling can be subjected to word segmentation and screening to obtain the screened associated words so as to update the associated word set, then the word segmentation and screening are repeatedly performed, and the associated word set is continuously expanded and updated, so that the problem that the existing word packet accumulation method is small in vocabulary quantity is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.

Optionally, the determining module includes: a first computation submodule and a second computation submodule.

The first calculation submodule is used for acquiring a relevance value of each text vocabulary corresponding to a preset relevance condition if the preset relevance condition is one, so as to obtain relevance index data of each text vocabulary; and the second calculation submodule is used for acquiring the relevance numerical value of each text vocabulary corresponding to each preset relevance condition if the preset relevance condition is multiple, performing fusion operation on all the relevance numerical values of each text vocabulary, and taking the fusion result as the relevance index data of each text vocabulary, wherein the fusion operation comprises at least one of weighting calculation, addition calculation and multiplication-division calculation.

In the above embodiment, after performing word segmentation on a web text crawled from a target data source to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies, determining association index data of each text vocabulary in the text dictionary and a current associated word according to a preset association condition, and if the preset association condition is one, calculating an association numerical value of each text vocabulary according to the preset association condition to obtain association index data of each text vocabulary and the current associated word; if the preset association condition is multiple, acquiring the association numerical value of each text vocabulary corresponding to each preset association condition, fusing all the association numerical values of each text vocabulary, taking the fused result as the association index data of each text vocabulary, screening the association index data of the text vocabularies or the part of speech information of the text vocabularies according to the preset screening condition to obtain the screened association vocabularies, and updating the association word set by using the screened association vocabularies. Through the embodiment, the association index data of each text vocabulary and the current association terms can be acquired by adopting the preset association conditions with different weights, so that the effect of flexibly acquiring the association index data can be achieved.

Optionally, the determining module may include: the determining submodule is used for taking the times of the text vocabularies meeting the preset association conditions as the association index data of the text vocabularies, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.

Optionally, the screening unit may include: the device comprises a first screening module, a second screening module and a third screening module. The first screening module is used for taking the text vocabulary of the associated index data in the preset range as the screened associated vocabulary; or the second screening module is used for ranking the first N text vocabularies in the associated index data of the plurality of text vocabularies as screened associated vocabularies; or the third screening module is used for taking the text vocabulary with the vocabulary information of the preset part of speech as the screened associated vocabulary.

In the above embodiment, after the web crawler crawls the web text from the target data source based on the current associated word in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, associated index data of the text words are screened according to preset screening conditions, or part-of-speech information of the text words is screened, or the associated index data of the text words and part-of-speech information of the text words are screened, wherein the screening can be performed by collecting the text words with associated index data in a preset range as screened associated words, or using the text words with associated index data ranked in top N names in the associated index data of the text words as screened associated words, or using the text words with word information as preset part-of-speech as screened associated words, the set of related words is then updated using the filtered related vocabulary. Through the embodiment, different preset screening conditions can be set to screen the associated vocabularies, so that flexible and effective screening can be realized, and different screening requirements of customers can be met.

Optionally, the update unit includes: a first update module and a second update module.

The first updating module is used for replacing the associated words by using the screened associated words so as to update the associated word set; or the second updating module is used for adding the screened associated words into the associated word set so as to update the associated word set.

The processing device for the associated word set comprises a processor and a memory, the above-mentioned crawling unit 10, the processing unit 30, the filtering unit 50, the updating unit 70 and the like are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set with one or more than one, the network text obtained by indifference crawling is participated and screened by adjusting kernel parameters to obtain screened associated words to update an associated word set, and then the participations and screening are repeatedly carried out to continuously expand and update the associated word set, so that the problem that the vocabulary amount is less in the existing word packet accumulation method is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed; the method comprises the steps of segmenting a network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and associated words and/or part-of-speech information of each text vocabulary; screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies; and updating the associated word set by using the screened associated vocabulary.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A processing method for associating word sets is characterized by comprising the following steps:

crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed;

the method comprises the steps of performing word segmentation on the network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, the associated index data is used for indicating the association degree of each text vocabulary and the associated words, the associated index data of each text vocabulary is determined by preset associated conditions, and the preset associated conditions comprise: each text vocabulary and the associated words appear in the same sentence of the web text at the same time; and/or each of the text vocabularies and the associated word appear at the same position in the sentence of the web text with the same part of speech within the web text;

screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies;

updating the associated word set by using the screened associated vocabulary;

the method comprises the following steps of segmenting the web text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the step of segmenting the web text to obtain the plurality of text vocabularies comprises the following steps:

after the network text is subjected to word segmentation to obtain a plurality of text vocabularies, creating text dictionaries of the text vocabularies;

and determining the association index data of each text vocabulary in the text dictionary according to a preset association condition, and/or extracting the part-of-speech information of each text vocabulary in the text dictionary.

2. The processing method of claim 1, wherein determining the relevance index data of each text vocabulary in the text dictionary according to a preset relevance condition comprises:

if the preset association condition is one, acquiring an association numerical value of each text vocabulary corresponding to the preset association condition to obtain association index data of each text vocabulary;

if the preset association condition is multiple, acquiring an association numerical value of each text vocabulary corresponding to each preset association condition, performing fusion operation on all the association numerical values of each text vocabulary, and taking a fusion result as association index data of each text vocabulary, wherein the fusion operation comprises at least one of weighted calculation, addition calculation and multiplication-division calculation.

3. The processing method of claim 1, wherein determining the relevance index data of each text vocabulary in the text dictionary according to a preset relevance condition comprises:

and taking the times of the text vocabularies meeting the preset association conditions as association index data of the text vocabularies.

4. The processing method according to claim 1, wherein the step of filtering the relevance index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset filtering condition to obtain filtered relevance vocabularies comprises:

taking the text vocabulary of which the associated index data are in a preset range as the screened associated vocabulary; or

The related index data of the plurality of text vocabularies is ranked as the top N text vocabularies as the screened related vocabularies; or

And taking the text vocabulary with the vocabulary information of a preset part of speech as the screened associated vocabulary.

5. The processing method according to any one of claims 1 to 4, wherein updating the set of related words using the filtered out related vocabulary comprises:

replacing the associated words with the screened associated words to update the associated word set;

or

Adding the screened associated words into the associated word set to update the associated word set.

6. A processing apparatus for associating sets of words, comprising:

the crawling unit is used for crawling the network text from the target data source based on the associated words in the associated word set of the object to be analyzed;

the processing unit is configured to perform word segmentation on the web text to obtain a plurality of text vocabularies, and acquire vocabulary information of each text vocabulary, where the vocabulary information includes associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used to indicate a degree of association between each text vocabulary and the associated word, where the associated index data of each text vocabulary is determined by a preset association condition, and the preset association condition includes: each text vocabulary and the associated words appear in the same sentence of the web text at the same time; and/or each of the text vocabularies and the associated word appear at the same position in the sentence of the web text with the same part of speech within the web text;

the screening unit is used for screening the associated index data of the text vocabularies and/or the part of speech information of the text vocabularies according to preset screening conditions to obtain screened associated vocabularies;

the updating unit is used for updating the associated word set by using the screened associated words;

wherein the processing unit comprises:

the creating module is used for creating a text dictionary of a plurality of text vocabularies after the network text is subjected to word segmentation to obtain the plurality of text vocabularies;

and the determining module is used for determining the association index data of each text vocabulary in the text dictionary according to a preset association condition and/or extracting the part of speech information of each text vocabulary in the text dictionary.

7. The processing apparatus of claim 6, wherein the determining module comprises:

the first calculation submodule is used for acquiring a relevance value of each text vocabulary corresponding to the preset relevance condition if the preset relevance condition is one, so as to obtain relevance index data of each text vocabulary;

and the second calculation submodule is used for acquiring a relevance numerical value of each text vocabulary corresponding to each preset relevance condition if the preset relevance condition is multiple, performing fusion operation on all the relevance numerical values of each text vocabulary, and taking a fusion result as relevance index data of each text vocabulary, wherein the fusion operation comprises at least one of weighting calculation, addition calculation and multiplication-division calculation.

8. The processing apparatus of claim 6, wherein the determining module comprises:

and the determining submodule is used for taking the times of the text vocabularies meeting the preset association conditions as association index data of the text vocabularies.