CN106649334B - Processing method and device of associated word set - Google Patents

Processing method and device of associated word set Download PDF

Info

Publication number
CN106649334B
CN106649334B CN201510726038.1A CN201510726038A CN106649334B CN 106649334 B CN106649334 B CN 106649334B CN 201510726038 A CN201510726038 A CN 201510726038A CN 106649334 B CN106649334 B CN 106649334B
Authority
CN
China
Prior art keywords
text
vocabulary
vocabularies
index data
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510726038.1A
Other languages
Chinese (zh)
Other versions
CN106649334A (en
Inventor
梁梦溪
何鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510726038.1A priority Critical patent/CN106649334B/en
Publication of CN106649334A publication Critical patent/CN106649334A/en
Application granted granted Critical
Publication of CN106649334B publication Critical patent/CN106649334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for processing a related word set. The processing method comprises the following steps: crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed; the method comprises the steps of segmenting a network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used for indicating the association degree of each text vocabulary and associated words; screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies; and updating the associated word set by using the screened associated vocabulary. The method and the device solve the technical problem that the vocabulary amount is less in the existing method for word packet accumulation.

Description

Processing method and device of associated word set
Technical Field
The application relates to the field of internet, in particular to a method and a device for processing a related word set.
Background
When an enterprise issues a product, pushes out a service, or a government issues a policy, and a certain instant event which causes social attention occurs, some contents such as news related to network media reports appear on the internet, and the network news will cause the attention and discussion of netizens. In the process of collecting internet public opinion content (namely, internet text related to an object) aiming at a certain analysis object (such as current affairs, products, people, policies and the like), if a network crawler crawls the internet text related to the analysis object to collect information, because whether the content is related to the analysis object or not is not distinguished in the crawling process, the content needs to be screened after the internet text is crawled to filter out the content related to the object to be analyzed.
Generally, in the process of screening and filtering web texts, whether a section of web text is related to an object to be analyzed is judged by setting certain judgment conditions, a set of contents related to the object to be analyzed is used as a "word package", and the contents in the "word package" are used to replace the object to be analyzed to screen and filter the web text process, which may also be referred to as word package accumulation.
The existing basic method for word packet accumulation is manual input by artificial association, and the combination method of the following vocabularies is mostly adopted, namely, the name of an object to be analyzed is used as a word packet; taking the combination of the name of the object to be analyzed and the synonym as a word packet; and taking the combination of the object name to be analyzed and the competitive product word as a word packet. It can be seen that the existing word packet accumulation method has the disadvantages that: the vocabulary is less; whether the relation between the vocabulary and the analysis object is tight or not cannot be quantitatively measured; the time required by accumulation of the vocabularies participated manually is long, and the efficiency is low; and poor scalability.
Aiming at the problem that the vocabulary of the existing method for accumulating word packets is less, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing a related word set, which at least solve the technical problem that the vocabulary of the existing word packet accumulation method is less.
According to an aspect of an embodiment of the present application, there is provided a processing method for associating a word set, the processing method including: crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed; the method comprises the steps of segmenting a network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used for indicating the association degree of each text vocabulary and associated words; screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies; and updating the associated word set by using the screened associated vocabulary.
Further, performing word segmentation on the web text to obtain a plurality of text words, and acquiring word information of each text word includes: after the network text is subjected to word segmentation to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies; and determining the association index data of each text vocabulary in the text dictionary according to a preset association condition, and/or extracting the part of speech information of each text vocabulary in the text dictionary.
Further, determining the association index data of each text vocabulary in the text dictionary according to the preset association condition comprises: if the preset association condition is one, acquiring an association numerical value of each text vocabulary corresponding to the preset association condition to obtain association index data of each text vocabulary; if the preset association condition is multiple, acquiring an association numerical value of each text vocabulary corresponding to each preset association condition, performing fusion operation on all the association numerical values of each text vocabulary, and taking a fusion result as association index data of each text vocabulary, wherein the fusion operation comprises at least one of weighted calculation, addition calculation and multiplication-division calculation.
Further, determining the association index data of each text vocabulary in the text dictionary according to the preset association condition comprises: and taking the times of each text vocabulary meeting preset association conditions as association index data of each text vocabulary, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.
Further, screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition, and obtaining the screened associated vocabularies comprises: taking the text vocabulary of which the associated index data are in a preset range as the screened associated vocabulary; or the related index data of the plurality of text vocabularies is ranked in the top N text vocabularies as the screened related vocabularies; or taking the text vocabulary with the vocabulary information being the preset part of speech as the screened associated vocabulary.
Further, updating the set of associated terms using the filtered out associated vocabulary includes: replacing the associated words by using the screened associated words to update the associated word set; or adding the screened associated words into the associated word set to update the associated word set.
According to another aspect of the embodiments of the present application, there is also provided a processing apparatus for associating word sets, the processing apparatus including: the crawling unit is used for crawling the network text from the target data source based on the associated words in the associated word set of the object to be analyzed; the processing unit is used for segmenting the network text to obtain a plurality of text vocabularies and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used for indicating the association degree of each text vocabulary and the associated words; the screening unit is used for screening the associated index data of the text vocabularies and/or the part of speech information of the text vocabularies according to preset screening conditions to obtain screened associated vocabularies; and the updating unit is used for updating the associated word set by using the screened associated vocabulary.
Further, the processing unit includes: the creating module is used for creating a text dictionary of a plurality of text vocabularies after the network text is subjected to word segmentation to obtain a plurality of text vocabularies; and the determining module is used for determining the association index data of each text vocabulary in the text dictionary according to the preset association condition and/or extracting the part of speech information of each text vocabulary in the text dictionary.
Further, the determining module includes: the first calculation submodule is used for acquiring the relevance value of each text vocabulary corresponding to the preset relevance condition if the preset relevance condition is one, and obtaining the relevance index data of each text vocabulary; and the second calculation submodule is used for acquiring the relevance numerical value of each text vocabulary corresponding to each preset relevance condition if the preset relevance condition is multiple, performing fusion operation on all the relevance numerical values of each text vocabulary, and taking the fusion result as the relevance index data of each text vocabulary, wherein the fusion operation comprises at least one of weighting calculation, addition calculation and multiplication-division calculation.
Further, the determining module includes: the determining submodule is used for taking the times of the text vocabularies meeting the preset association conditions as the association index data of the text vocabularies, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.
In the embodiment of the application, after a web crawler crawls a web text from a target data source based on related words in a related word set of an object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, related index data of the text words or part-of-speech information of the text words is screened according to preset screening conditions, and after the screened related words are obtained through screening, the related word set is updated by using the screened related words. Through the embodiment, the network text obtained through indifference crawling can be subjected to word segmentation and screening to obtain the screened associated words so as to update the associated word set, then the word segmentation and screening are repeatedly performed, and the associated word set is continuously expanded and updated, so that the problem that the existing word packet accumulation method is small in vocabulary quantity is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow diagram of a method of processing a set of related words according to an embodiment of the present application;
FIG. 2 is a flow diagram of another alternative method of processing associated word sets in accordance with an embodiment of the present application; and
FIG. 3 is a schematic diagram of a processing device for associating word sets according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The noun explains:
analysis of the object: based on the web text content, the object of the public opinion content is to be analyzed. Possibly current events, products, people, policies, etc.
Corpus: web text crawled by crawlers.
Dictionary vocabulary: and after the words are segmented, the texts in the corpus are stored in a single word and a word library in a relationship form among the words.
Relevance: refers to the degree of closeness between multiple objects (words).
Screening logic: conditional algorithms to filter vocabulary.
Word bag: the method is used for replacing the analysis object to be used as a set consisting of vocabularies for screening the network texts in the corpus and filtering the contents related to the analysis object.
Example 1
In accordance with an embodiment of the present application, there is provided an embodiment of a method of processing a set of words associated, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of a processing method for associating word sets according to an embodiment of the present application, and as shown in fig. 1, the processing method includes the following steps:
step S102, crawling the network text from the target data source based on the associated words in the associated word set of the object to be analyzed.
Step S104, performing word segmentation on the network text to obtain a plurality of text words, and acquiring word information of each text word, wherein the word information comprises associated index data of each text word and/or part-of-speech information of each text word, and the associated index data is used for indicating the association degree of each text word and the associated word.
And step S106, screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to preset screening conditions to obtain screened associated vocabularies.
And step S108, updating the associated word set by using the screened associated vocabulary.
By adopting the method and the device, after the web crawler crawls the web text from the target data source based on the current associated words in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, associated index data of the text words or part of speech information of the text words is screened according to preset screening conditions, and after the screened associated words are obtained through screening, the associated word set is updated by using the screened associated words.
Through the embodiment, the network text obtained through indifference crawling can be subjected to word segmentation and screening to obtain the screened associated words so as to update the associated word set, then the word segmentation and screening are repeatedly performed, and the associated word set is continuously expanded and updated, so that the problem that the existing word packet accumulation method is small in vocabulary quantity is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.
In the above embodiment, the initial corpus may be established based on a large amount of web texts crawled without difference. After the network text in the initial language database is segmented, the relevance between the dictionary vocabulary after the segmentation (namely the text vocabulary) and the name of the analysis object (namely the relevant vocabulary) is measured and calculated in a certain method, and the dictionary vocabulary meeting the conditions (namely the text vocabulary meeting the preset screening conditions, namely the relevant vocabulary) is screened out through reasonable vocabulary screening logic to form a word packet. The word packet can be continuously expanded by repeating the steps, and the word packet content (namely the related word set) aiming at the analysis object is perfected.
Specifically, the indifferent crawling may refer to crawling all updated contents of the website within a period of time without setting a specific keyword. For example, the crawling is performed once a day, namely, the newly added articles, comments and other contents on the website in the previous day are all crawled, and the crawled contents are not repeatedly crawled.
Optionally, before crawling the web text from the target data source based on the associated word in the associated word set of the object to be analyzed, the name of the analysis object (i.e. the associated word in the associated word set of the object to be analyzed) may be determined, and specifically, the name of the object to be analyzed may be determined as the initial content of the word package.
In an alternative embodiment, after crawling web text, an initial corpus may be established. For a determined object to be analyzed (i.e., an associated word in the associated word set of the object to be analyzed), a certain amount of text content (i.e., the web text described above) is crawled indiscriminately from its target data source (e.g., a website, a forum, a post, etc.) as an initial corpus for the object to be analyzed. The larger the amount of text contained in the initial corpus is, the more advantageous the accuracy of the following relevance calculation is.
Optionally, the segmenting the web text to obtain a plurality of text vocabularies, and the obtaining vocabulary information of each text vocabulary includes: after the network text is subjected to word segmentation to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies; and determining the association index data of each text vocabulary in the text dictionary according to a preset association condition, and/or extracting the part of speech information of each text vocabulary in the text dictionary.
In the above embodiment, after the web text crawled from the target data source is segmented to obtain a plurality of text vocabularies, a text dictionary of the plurality of text vocabularies is created, association index data of each text vocabulary in the text dictionary and the current associated word is determined according to a preset association condition, or part-of-speech information of each text vocabulary in the text dictionary is extracted while association index data of each text vocabulary in the text dictionary and the current associated word is determined according to the preset association condition. And then screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies, and updating the associated word set by using the screened associated vocabularies.
Through the embodiment, the vocabulary information of the text vocabulary can be recorded by creating the text dictionary after word segmentation, so that the extraction of the vocabulary information of the text vocabulary is facilitated, and the effects of quickly and accurately acquiring information and accumulating word packets are realized.
Specifically, the crawled web texts may be used as an initial corpus, and then the text content (i.e., web texts) in the initial corpus is participled to construct a dictionary (i.e., text dictionary) containing all words (i.e., text words) in the text (i.e., web texts).
Optionally, determining association index data of each text vocabulary in the text dictionary according to a preset association condition includes: if the preset association condition is one, acquiring an association numerical value of each text vocabulary corresponding to the preset association condition to obtain association index data of each text vocabulary; if the preset association condition is multiple, acquiring an association numerical value of each text vocabulary corresponding to each preset association condition, performing fusion operation on all the association numerical values of each text vocabulary, and taking a fusion result as association index data of each text vocabulary, wherein the fusion operation comprises at least one of weighted calculation, addition calculation and multiplication-division calculation.
In the above embodiment, after performing word segmentation on a web text crawled from a target data source to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies, determining association index data of each text vocabulary in the text dictionary and a current associated word according to a preset association condition, and if the preset association condition is one, calculating an association numerical value of each text vocabulary according to the preset association condition to obtain association index data of each text vocabulary and the current associated word; if the preset association condition is multiple, acquiring the association numerical value of each text vocabulary corresponding to each preset association condition, fusing all the association numerical values of each text vocabulary, taking the fused result as the association index data of each text vocabulary, screening the association index data of the text vocabularies or the part of speech information of the text vocabularies according to the preset screening condition to obtain the screened association vocabularies, and updating the association word set by using the screened association vocabularies.
Through the embodiment, the association index data of each text vocabulary and the current association terms can be acquired by adopting the preset association conditions with different weights, so that the effect of flexibly acquiring the association index data can be achieved.
Specifically, the fusion operation may include at least one of a weighting calculation, an addition calculation, and a multiplication-division calculation in the above-described embodiment. For example, when the fusion operation includes weighting calculation, that is, if the preset association condition is multiple, the condition weight of the preset association condition may be obtained, the association value of each text vocabulary is calculated through each preset association condition, and the association index data of each text vocabulary is obtained by performing weighting calculation on each condition weight and the corresponding association value.
Optionally, determining association index data of each text vocabulary in the text dictionary according to the preset association condition may include: and taking the times of each text vocabulary meeting preset association conditions as association index data of each text vocabulary, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.
In the above embodiment, determining the preset association condition referred to by the association index data of each text vocabulary and the current associated word in the text dictionary may include: the number of times that each text vocabulary and the current associated word appear in the same sentence of the web text at the same time; or the times that each text vocabulary and the current associated word appear at the same position in the sentence of the web text in the same part of speech in the web text; or the combination of the two preset association conditions is the number of times that each text vocabulary and the current associated word appear in the same sentence of the web text at the same time, and the number of times that each text vocabulary and the current associated word appear in the same position in the sentence of the web text with the same part of speech in the web text. Through the embodiment, the association index data of each text vocabulary and the current associated word in the text dictionary can be effectively and accurately determined through the preset association condition.
The same positions in the above embodiments may specifically be: positions in each sentence of the web text which are the same in distance with the same word, such as positions in which the distance of a text vocabulary (such as decayed tooth) in the sentence is within five words from the same currently associated word (such as coca-cola), the positions of the text vocabulary (such as decayed tooth) in different sentences can be regarded as the same positions; or, the same positions in the above embodiments may specifically be: locations within the same word range in each sentence of web text may be considered to have the same location if, in different sentences, the same text vocabulary all appears within the first five words of the sentence.
Specifically, when calculating the association between the dictionary vocabulary (i.e., each of the above-mentioned text vocabularies) and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data), the association between the text vocabulary contained in the text dictionary and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data) may be calculated by presetting an association condition, which may include, but is not limited to, the following preset association conditions:
presetting a correlation condition 1: the dictionary vocabulary (i.e., each text vocabulary described above) and the analysis object name (i.e., the associated word described above) occur simultaneously within a sentence (or a passage, an article, etc.) of the web text.
For example, if the associated word is coca-cola and the text vocabulary in the dictionary includes snow-Bill, the preset association condition is: the method comprises the steps of enabling the sprite and the coca-cola to appear in the same sentence, counting the times of the situation that the sprite and the coca-cola appear in the same sentence at the same time, and taking the times as correlation index data. If the situation that the sprite and the coca-cola appear simultaneously in the same sentence 5 times in the sentence in the web text, the data of the association index between the sprite and the coca-cola is 5.
Preset association condition 2: in the case where the dictionary vocabulary (i.e., each text vocabulary described above) and the analysis target name (i.e., the related word described above) appear at the same position in the sentence in the same part of speech in the web text.
For example, if the associated word is coca-cola, the text vocabulary in the dictionary includes the word "kobi", and "kobi-good" appears in the first sentence of the web text and "kobi-poor" appears in the second sentence, the word "kobi" and the word "kobi-poor" appear in the web text at the same position (e.g. the head of the sentence) with the same part of speech (e.g. noun), and at this time, the number of times of all the words (e.g. the word "kobi") meeting the above conditions is counted.
The preset association condition for calculating the association index data may be more than one preset association condition, or multiple preset association conditions are combined, different weights are set, and a final association value (i.e. the association index data) is calculated, where a relationship between the association value and the association is: the higher the relevance value, the greater the relevance of the text vocabulary to the associated word.
Optionally, the screening the relevance index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition, and the obtaining of the screened relevance vocabularies includes: taking the text vocabulary of which the associated index data are in a preset range as the screened associated vocabulary; or the related index data of the plurality of text vocabularies is ranked in the top N text vocabularies as the screened related vocabularies; or taking the text vocabulary with the vocabulary information being the preset part of speech as the screened associated vocabulary.
In the above embodiment, after the web crawler crawls the web text from the target data source based on the associated words in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, and word information of each text word is obtained, the associated index data of the text words is screened according to the preset screening condition, or the part-of-speech information of the text words is screened, or the associated index data of the text words and the part-of-speech information of the text words are screened, wherein the screening can be performed by collecting the text words with the associated index data in the preset range as the screened associated words, or using the text words with the associated index data ranked first N names in the associated index data of the text words as the screened associated words, or using the text words with the part-of-speech information as the screened associated words, the set of related words is then updated using the filtered related vocabulary. Through the embodiment, different preset screening conditions can be set to screen the associated vocabularies, so that flexible and effective screening can be realized, and different screening requirements of customers can be met.
Specifically, the preset filtering condition for determining the vocabulary of the word package (i.e. the above-mentioned associated word set) may include, but is not limited to, the following conditions:
the first optional preset screening condition is: all the text words of the relevance numerical value (i.e. the relevant index data) in a certain interval (for example, the value of the relevant index data is greater than a certain threshold, or the value of the relevant index data is between two preset numerical values, etc.).
The second optional preset screening condition is: all the text vocabularies with the top N names of relevance (namely the relevance index data).
The third optional preset screening condition is: some text vocabulary of a specified part of speech.
And screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to the preset screening conditions, wherein the selected preset screening condition can be one of the preset screening conditions or a combination of the preset screening conditions, and the intersection of the screened associated vocabularies is taken as an associated word set.
In an alternative embodiment, before the relevance index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies is filtered according to the preset filtering condition, the relevance measurement values (i.e. the relevance index data) of the dictionary vocabularies (i.e. the text vocabularies) and the analysis object names (i.e. the relevance words) can be sorted. Specifically, the relevance indexes (i.e., the relevance index data) obtained by the text words in the text dictionary under the preset relevance conditions are sorted from high to low to serve as the subsequent screening content.
Optionally, updating the set of associated words using the filtered associated vocabulary includes: replacing the associated words by using the screened associated words to update the associated word set; or adding the screened associated words into the associated word set to update the associated word set.
Specifically, the screened associated vocabulary is used as a vocabulary packet vocabulary, and a vocabulary packet (i.e., the above-mentioned associated vocabulary set) for the object to be analyzed is established. The word package (i.e., the related word set) can also be used to replace the analysis object name (i.e., the related word) in the next loop of the above process, so as to calculate the relevance of the dictionary vocabulary (i.e., the text vocabulary), to expand the analysis object word package (i.e., the related word set) to a greater extent, and to continuously improve the accuracy of the relevance (i.e., the relevance index data) calculation.
In an optional embodiment, as shown in fig. 2, the processing method of the related term set may specifically include the following steps:
step S202, determining the associated words in the associated word set of the object to be analyzed.
Specifically, the object to be analyzed is determined, and the name of the object to be analyzed may be used as the initial content of the word package (i.e., the current associated word in the associated word set).
Step S203, crawling the web text and establishing an initial corpus.
Specifically, the web text may be crawled from target data sources based on current associated words in the associated word set of the object to be analyzed, where the target data sources may include websites, forums, posts, and the like.
And step S204, performing word segmentation on the network text to construct a text dictionary.
Specifically, the network text may be segmented to obtain a plurality of text vocabularies, vocabulary information of each text vocabulary is obtained, where the vocabulary information includes associated index data of each text vocabulary and a current associated word and/or part-of-speech information of each text vocabulary, and then a text dictionary including all text vocabularies in the network text is constructed.
Step S205, measuring and calculating the association index data of each text vocabulary and associated words in the text dictionary.
Specifically, the associated index data of the plurality of text vocabularies or the part-of-speech information of the plurality of text vocabularies may be screened according to a preset screening condition to obtain the screened associated vocabularies.
Step S206, sorting the associated index data of each text vocabulary in the text dictionary.
Specifically, the measured values of the association index data of each text vocabulary in the text dictionary may be sorted in order from high to low, so as to facilitate the subsequent screening process.
Alternatively, when the association between the dictionary vocabulary (i.e., each of the above-mentioned text vocabularies) and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data) is calculated, the association between the text vocabulary contained in the text dictionary and the analysis target name (i.e., the above-mentioned associated word) (i.e., the above-mentioned associated index data) may be calculated by presetting an association condition, which may include, but is not limited to:
the number of times the analysis object name (i.e., the above-described associated word) co-occurs within a sentence (or a passage, an article, etc.) of the web text.
The number of times that the analysis object name (i.e., the above-described related word) appears in the same part of speech in the sentence within the web text.
The preset association condition for calculating the association index data may be more than one preset association condition, or multiple preset association conditions are combined, different weights are set, and a final association value (i.e. the association index data) is calculated, where a relationship between the association value and the association is: the higher the relevance value, the greater the relevance of the text vocabulary to the current associated word.
Step S207, setting preset screening conditions, and screening text vocabularies in the text dictionary.
Specifically, the preset filtering condition for determining the vocabulary of the word package (i.e. the above-mentioned associated word set) may include, but is not limited to, the following conditions:
the first optional preset screening condition is: all the text words of the relevance numerical value (i.e. the relevant index data) in a certain interval (for example, the value of the relevant index data is greater than a certain threshold, or the value of the relevant index data is between two preset numerical values, etc.).
The second optional preset screening condition is: all the text vocabularies with the top N names of relevance (namely the relevance index data).
The third optional preset screening condition is: some text vocabulary of a specified part of speech.
And screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to the preset screening conditions, wherein the selected preset screening condition can be one of the preset screening conditions or a combination of the preset screening conditions, and the intersection of the screened associated vocabularies is taken as an associated word set.
And step S208, establishing a related word set.
In particular, the set of related words may be updated using the filtered out related vocabulary.
Compared with the existing word packet accumulation method, the word packet accumulation method adopted by the embodiment of the application has the advantages that: the vocabulary in the associated word set is increased quickly, and the word packet accumulation efficiency is obviously improved; whether the association exists between the word packet words (namely the associated words) and the analysis objects (namely the associated words) can be quantitatively measured; the preset association conditions for association calculation between the word packet words (namely, associated words) and the analysis objects (namely, associated words) can be flexibly set, and calculation can be performed in a condition combination mode; the method can sort the values of the associated index data and then carry out screening, thereby flexibly setting the preset screening conditions of the associated index data and carrying out screening in a mode of combining the preset screening conditions; the above-mentioned processes of word packet accumulation can also be operated circularly, the generated word packet (i.e. associated word set) in the above period replaces the name of the analysis object (associated word) in the present period, and the process of word packet accumulation can be iteratively performed repeatedly, so that the content of the word packet (i.e. the content of the associated word set) can be continuously expanded, the accuracy of the content of the word packet can be improved, and the coverage of the content of the word packet can be enlarged.
Example 2
According to an embodiment of the present application, there is further provided an embodiment of a processing apparatus for associating word sets, as shown in fig. 3, the processing apparatus includes: a crawling unit 10, a processing unit 30, a screening unit 50 and an updating unit 70.
The crawling unit 10 is configured to crawl a web text from a target data source based on associated terms in an associated term set of an object to be analyzed.
The processing unit 30 is configured to perform word segmentation on the web text to obtain a plurality of text words, and obtain word information of each text word, where the word information includes association index data of each text word and/or part-of-speech information of each text word, and the association index data is used to indicate association degrees of each text word and associated words.
And the screening unit 50 is configured to screen the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies.
And an updating unit 70 for updating the associated word set by using the screened associated vocabulary.
Optionally, the processing unit comprises: the device comprises a creating module and a determining module.
The system comprises a creating module, a searching module and a searching module, wherein the creating module is used for creating a text dictionary of a plurality of text vocabularies after the network text is subjected to word segmentation to obtain the plurality of text vocabularies; and the determining module is used for determining the association index data of each text vocabulary in the text dictionary according to the preset association condition and/or extracting the part of speech information of each text vocabulary in the text dictionary.
By adopting the method and the device, after the web crawler crawls the web text from the target data source based on the current associated words in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, associated index data of the text words or part of speech information of the text words is screened according to preset screening conditions, and after the screened associated words are obtained through screening, the associated word set is updated by using the screened associated words. Through the embodiment, the network text obtained through indifference crawling can be subjected to word segmentation and screening to obtain the screened associated words so as to update the associated word set, then the word segmentation and screening are repeatedly performed, and the associated word set is continuously expanded and updated, so that the problem that the existing word packet accumulation method is small in vocabulary quantity is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.
Optionally, the determining module includes: a first computation submodule and a second computation submodule.
The first calculation submodule is used for acquiring a relevance value of each text vocabulary corresponding to a preset relevance condition if the preset relevance condition is one, so as to obtain relevance index data of each text vocabulary; and the second calculation submodule is used for acquiring the relevance numerical value of each text vocabulary corresponding to each preset relevance condition if the preset relevance condition is multiple, performing fusion operation on all the relevance numerical values of each text vocabulary, and taking the fusion result as the relevance index data of each text vocabulary, wherein the fusion operation comprises at least one of weighting calculation, addition calculation and multiplication-division calculation.
In the above embodiment, after performing word segmentation on a web text crawled from a target data source to obtain a plurality of text vocabularies, creating a text dictionary of the plurality of text vocabularies, determining association index data of each text vocabulary in the text dictionary and a current associated word according to a preset association condition, and if the preset association condition is one, calculating an association numerical value of each text vocabulary according to the preset association condition to obtain association index data of each text vocabulary and the current associated word; if the preset association condition is multiple, acquiring the association numerical value of each text vocabulary corresponding to each preset association condition, fusing all the association numerical values of each text vocabulary, taking the fused result as the association index data of each text vocabulary, screening the association index data of the text vocabularies or the part of speech information of the text vocabularies according to the preset screening condition to obtain the screened association vocabularies, and updating the association word set by using the screened association vocabularies. Through the embodiment, the association index data of each text vocabulary and the current association terms can be acquired by adopting the preset association conditions with different weights, so that the effect of flexibly acquiring the association index data can be achieved.
Optionally, the determining module may include: the determining submodule is used for taking the times of the text vocabularies meeting the preset association conditions as the association index data of the text vocabularies, wherein the preset association conditions comprise: each text vocabulary and the associated words appear in the same sentence of the network text at the same time; and/or each text vocabulary and associated words appear in the same part-of-speech within the web text in the same place in the sentence of the web text.
In the above embodiment, determining the preset association condition referred to by the association index data of each text vocabulary and the current associated word in the text dictionary may include: the number of times that each text vocabulary and the current associated word appear in the same sentence of the web text at the same time; or the times that each text vocabulary and the current associated word appear at the same position in the sentence of the web text in the same part of speech in the web text; or the combination of the two preset association conditions is the number of times that each text vocabulary and the current associated word appear in the same sentence of the web text at the same time, and the number of times that each text vocabulary and the current associated word appear in the same position in the sentence of the web text with the same part of speech in the web text. Through the embodiment, the association index data of each text vocabulary and the current associated word in the text dictionary can be effectively and accurately determined through the preset association condition.
Optionally, the screening unit may include: the device comprises a first screening module, a second screening module and a third screening module. The first screening module is used for taking the text vocabulary of the associated index data in the preset range as the screened associated vocabulary; or the second screening module is used for ranking the first N text vocabularies in the associated index data of the plurality of text vocabularies as screened associated vocabularies; or the third screening module is used for taking the text vocabulary with the vocabulary information of the preset part of speech as the screened associated vocabulary.
In the above embodiment, after the web crawler crawls the web text from the target data source based on the current associated word in the associated word set of the object to be analyzed, the web text is segmented to obtain a plurality of text words, word information of each text word is obtained, associated index data of the text words are screened according to preset screening conditions, or part-of-speech information of the text words is screened, or the associated index data of the text words and part-of-speech information of the text words are screened, wherein the screening can be performed by collecting the text words with associated index data in a preset range as screened associated words, or using the text words with associated index data ranked in top N names in the associated index data of the text words as screened associated words, or using the text words with word information as preset part-of-speech as screened associated words, the set of related words is then updated using the filtered related vocabulary. Through the embodiment, different preset screening conditions can be set to screen the associated vocabularies, so that flexible and effective screening can be realized, and different screening requirements of customers can be met.
Optionally, the update unit includes: a first update module and a second update module.
The first updating module is used for replacing the associated words by using the screened associated words so as to update the associated word set; or the second updating module is used for adding the screened associated words into the associated word set so as to update the associated word set.
Specifically, the screened associated vocabulary is used as a vocabulary packet vocabulary, and a vocabulary packet (i.e., the above-mentioned associated vocabulary set) for the object to be analyzed is established. The word package (i.e., the related word set) can also be used to replace the analysis object name (i.e., the related word) in the next loop of the above process, so as to calculate the relevance of the dictionary vocabulary (i.e., the text vocabulary), to expand the analysis object word package (i.e., the related word set) to a greater extent, and to continuously improve the accuracy of the relevance (i.e., the relevance index data) calculation.
The processing device for the associated word set comprises a processor and a memory, the above-mentioned crawling unit 10, the processing unit 30, the filtering unit 50, the updating unit 70 and the like are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set with one or more than one, the network text obtained by indifference crawling is participated and screened by adjusting kernel parameters to obtain screened associated words to update an associated word set, and then the participations and screening are repeatedly carried out to continuously expand and update the associated word set, so that the problem that the vocabulary amount is less in the existing word packet accumulation method is solved, and the effect of perfecting the associated word set of the object to be analyzed is achieved.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed; the method comprises the steps of segmenting a network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and associated words and/or part-of-speech information of each text vocabulary; screening the associated index data of the plurality of text vocabularies or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies; and updating the associated word set by using the screened associated vocabulary.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (8)

1. A processing method for associating word sets is characterized by comprising the following steps:
crawling a web text from a target data source based on associated words in an associated word set of an object to be analyzed;
the method comprises the steps of performing word segmentation on the network text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the vocabulary information comprises associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, the associated index data is used for indicating the association degree of each text vocabulary and the associated words, the associated index data of each text vocabulary is determined by preset associated conditions, and the preset associated conditions comprise: each text vocabulary and the associated words appear in the same sentence of the web text at the same time; and/or each of the text vocabularies and the associated word appear at the same position in the sentence of the web text with the same part of speech within the web text;
screening the associated index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset screening condition to obtain screened associated vocabularies;
updating the associated word set by using the screened associated vocabulary;
the method comprises the following steps of segmenting the web text to obtain a plurality of text vocabularies, and acquiring vocabulary information of each text vocabulary, wherein the step of segmenting the web text to obtain the plurality of text vocabularies comprises the following steps:
after the network text is subjected to word segmentation to obtain a plurality of text vocabularies, creating text dictionaries of the text vocabularies;
and determining the association index data of each text vocabulary in the text dictionary according to a preset association condition, and/or extracting the part-of-speech information of each text vocabulary in the text dictionary.
2. The processing method of claim 1, wherein determining the relevance index data of each text vocabulary in the text dictionary according to a preset relevance condition comprises:
if the preset association condition is one, acquiring an association numerical value of each text vocabulary corresponding to the preset association condition to obtain association index data of each text vocabulary;
if the preset association condition is multiple, acquiring an association numerical value of each text vocabulary corresponding to each preset association condition, performing fusion operation on all the association numerical values of each text vocabulary, and taking a fusion result as association index data of each text vocabulary, wherein the fusion operation comprises at least one of weighted calculation, addition calculation and multiplication-division calculation.
3. The processing method of claim 1, wherein determining the relevance index data of each text vocabulary in the text dictionary according to a preset relevance condition comprises:
and taking the times of the text vocabularies meeting the preset association conditions as association index data of the text vocabularies.
4. The processing method according to claim 1, wherein the step of filtering the relevance index data of the plurality of text vocabularies and/or the part of speech information of the plurality of text vocabularies according to a preset filtering condition to obtain filtered relevance vocabularies comprises:
taking the text vocabulary of which the associated index data are in a preset range as the screened associated vocabulary; or
The related index data of the plurality of text vocabularies is ranked as the top N text vocabularies as the screened related vocabularies; or
And taking the text vocabulary with the vocabulary information of a preset part of speech as the screened associated vocabulary.
5. The processing method according to any one of claims 1 to 4, wherein updating the set of related words using the filtered out related vocabulary comprises:
replacing the associated words with the screened associated words to update the associated word set;
or
Adding the screened associated words into the associated word set to update the associated word set.
6. A processing apparatus for associating sets of words, comprising:
the crawling unit is used for crawling the network text from the target data source based on the associated words in the associated word set of the object to be analyzed;
the processing unit is configured to perform word segmentation on the web text to obtain a plurality of text vocabularies, and acquire vocabulary information of each text vocabulary, where the vocabulary information includes associated index data of each text vocabulary and/or part-of-speech information of each text vocabulary, and the associated index data is used to indicate a degree of association between each text vocabulary and the associated word, where the associated index data of each text vocabulary is determined by a preset association condition, and the preset association condition includes: each text vocabulary and the associated words appear in the same sentence of the web text at the same time; and/or each of the text vocabularies and the associated word appear at the same position in the sentence of the web text with the same part of speech within the web text;
the screening unit is used for screening the associated index data of the text vocabularies and/or the part of speech information of the text vocabularies according to preset screening conditions to obtain screened associated vocabularies;
the updating unit is used for updating the associated word set by using the screened associated words;
wherein the processing unit comprises:
the creating module is used for creating a text dictionary of a plurality of text vocabularies after the network text is subjected to word segmentation to obtain the plurality of text vocabularies;
and the determining module is used for determining the association index data of each text vocabulary in the text dictionary according to a preset association condition and/or extracting the part of speech information of each text vocabulary in the text dictionary.
7. The processing apparatus of claim 6, wherein the determining module comprises:
the first calculation submodule is used for acquiring a relevance value of each text vocabulary corresponding to the preset relevance condition if the preset relevance condition is one, so as to obtain relevance index data of each text vocabulary;
and the second calculation submodule is used for acquiring a relevance numerical value of each text vocabulary corresponding to each preset relevance condition if the preset relevance condition is multiple, performing fusion operation on all the relevance numerical values of each text vocabulary, and taking a fusion result as relevance index data of each text vocabulary, wherein the fusion operation comprises at least one of weighting calculation, addition calculation and multiplication-division calculation.
8. The processing apparatus of claim 6, wherein the determining module comprises:
and the determining submodule is used for taking the times of the text vocabularies meeting the preset association conditions as association index data of the text vocabularies.
CN201510726038.1A 2015-10-29 2015-10-29 Processing method and device of associated word set Active CN106649334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510726038.1A CN106649334B (en) 2015-10-29 2015-10-29 Processing method and device of associated word set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510726038.1A CN106649334B (en) 2015-10-29 2015-10-29 Processing method and device of associated word set

Publications (2)

Publication Number Publication Date
CN106649334A CN106649334A (en) 2017-05-10
CN106649334B true CN106649334B (en) 2020-09-15

Family

ID=58830513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510726038.1A Active CN106649334B (en) 2015-10-29 2015-10-29 Processing method and device of associated word set

Country Status (1)

Country Link
CN (1) CN106649334B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984514A (en) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 Acquisition methods and device, storage medium, the processor of word
CN108984570A (en) * 2017-06-05 2018-12-11 北京国双科技有限公司 There are the merging method and device of intersection set
CN108984573A (en) * 2017-06-05 2018-12-11 北京国双科技有限公司 There are the merging method and device of intersection set
CN107908654B (en) * 2017-10-12 2021-12-07 广州艾媒数聚信息咨询股份有限公司 Knowledge base-based recommendation method, system and device
CN109087163B (en) * 2018-07-06 2021-07-09 创新先进技术有限公司 Credit assessment method and device
TWI681304B (en) * 2018-12-14 2020-01-01 財團法人工業技術研究院 System and method for adaptively adjusting related search words
CN109885696A (en) * 2019-02-01 2019-06-14 杭州晶一智能科技有限公司 A kind of foreign language word library construction method based on self study
CN109902295A (en) * 2019-02-01 2019-06-18 杭州晶一智能科技有限公司 A kind of foreign language word library self-training method based on the network information

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8782082B1 (en) * 2011-11-07 2014-07-15 Trend Micro Incorporated Methods and apparatus for multiple-keyword matching

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5026192B2 (en) * 2007-08-20 2012-09-12 株式会社リコー Document creation system, user terminal, server device, and program
CN103324641B (en) * 2012-03-23 2016-07-13 日电(中国)有限公司 Information record recommendation method and device
US9378504B2 (en) * 2012-07-18 2016-06-28 Google Inc. Highlighting related points of interest in a geographical region
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device
CN104360993B (en) * 2014-11-19 2018-03-30 广州极盛信息科技开发有限公司 A kind of method from content needed for Text Feature Extraction
CN104462439B (en) * 2014-12-15 2017-12-19 北京国双科技有限公司 The recognition methods of event and device
CN104408191B (en) * 2014-12-15 2017-11-21 北京国双科技有限公司 The acquisition methods and device of the association keyword of keyword
CN104765830B (en) * 2015-04-13 2018-11-20 天脉聚源(北京)传媒科技有限公司 A kind of information search method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8782082B1 (en) * 2011-11-07 2014-07-15 Trend Micro Incorporated Methods and apparatus for multiple-keyword matching

Also Published As

Publication number Publication date
CN106649334A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649334B (en) Processing method and device of associated word set
US11620455B2 (en) Intelligently summarizing and presenting textual responses with machine learning
Srinath et al. Privacy at scale: Introducing the PrivaSeer corpus of web privacy policies
CN106815207B (en) Information processing method and device for legal referee document
CN105183781B (en) Information recommendation method and device
US8566303B2 (en) Determining word information entropies
Guzman et al. On-line relevant anomaly detection in the Twitter stream: an efficient bursty keyword detection model
EP2755146A1 (en) Identification of significant terms in data
CN104809108B (en) Information monitoring analysis system
CN104978314B (en) Media content recommendations method and device
TW201214169A (en) Recognition of target words using designated characteristic values
CN106202126B (en) A kind of data analysing method and device for logistics monitoring
CN107578263A (en) A kind of detection method, device and the electronic equipment of advertisement abnormal access
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
CN104142913A (en) Distinguishing method and distinguishing system for polarities of words and expressions
CN102880647A (en) Method and device for acquiring another name of organization
EP3460704A1 (en) Virus database acquisition method and device, equipment, server and system
CN109635084A (en) A kind of real-time quick De-weight method of multi-source data document and system
CN109697231A (en) A kind of display methods, system, storage medium and the processor of case document
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
Zarrad et al. The evaluation of the public opinion-a case study: Mers-cov infection virus in ksa
CN109471934B (en) Financial risk clue mining method based on Internet
CN108319606B (en) Construction method and device of professional database
Gupta et al. Identifying radical social media posts using machine learning
CN111859146A (en) Information mining method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant