AU2019101463A4

AU2019101463A4 - Method of searching and mining of social information on Internet based on Elasticsearch

Info

Publication number: AU2019101463A4
Application number: AU2019101463A
Authority: AU
Inventors: Songhao Li; Donghang Sui; Qixin You
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-01-23
Anticipated expiration: 2027-11-27

Abstract

This invention lies in the field of Information Retrival. It can help search and analyse social network information based on Elasticsearch. The invention consists of the following steps: Firstly, we use web crawler to acquire a sufficient number of information from the social media. Secondly, we use IK analyser to seperate each meaningful word or phrase. Thirdly, we imported the data into Elasticseach and created inverted index. By constantly adjusting the process of creating inverted index, the model will reach the optimal performance. Finally, when key words are input, the ranked revelant information would be displayed by the system. In brief, this invention can automatically select the information of a particular field that you are interested in. Text source Text analysis Word Segmentation Data Processing Date Processing Title Reco n POS Tagging Text Structure Analyzer Feature Extraction Characteristic words and weight Key words summary Specific information extraction Searching and classifying Clustering filtering UI And Presentation Search lt Browse Users Figure 1 ALL OF TEXTS ABOGH, l,J RELEVANT TEXTS DEF RETRIEVED AND RELEVANT EXTS \2 RIEVED TEXT Figure 2

Description

2019101463 27 Nov 2019

Title

Method of searching and mining of social information on Internet based on Elasticsearch Field of The Invention

The invention relates to the field of Information Retrival, which uses Elasticsearch to search and analyse Internet social data.

Background

With the development of the era of big data, Internet information is expanding. How Internet users use the Internet to search accurately and obtain the effective information they need has become an urgent problem to be solved and an important research direction of the current Internet industry. ES is a real-time distributed search and analysis engine, an open-source full-text search service based on Lucene, and a popular enterprise search engine. Es uses Lucene as its internal engine. When full-text search is carried out, it only needs to use the uniformly developed API. In addition, ES can realize distributed file storage and index every field so that it can be searched.

SUMMARY

The first step is data collection and data import.

Using a crawler is the easiest way for data collection. Crawler, also known as network crawler, mainly refers to the script back-end program for data collection from the Internet, which is the basis of data analysis and data mining. Crawler have a advantage: it can be reused. You just need to design one program then you can have all the data.

Talking about data import. We have 3 ways.

The first way is using java or python program to format data. Then put it into search engine.

The second way is using Plug-in unit. Many search engines have their own plug-in unit which support format data importing into search machine. This way you can import data into search engines.

The third way is using tool. These tools enable data to flow into search engines automatically. But these tools have many requirements. For example, logstash requires file must be json. So we have to format data before using tool.

The second step is to create index.

The creation of index library is the core technology of search engine. For the huge amount of Internet information obtained by the crawler, how to quickly find all web pages including user query words from the vast web page information belongs to the function that the indexer needs to complete. The main function of the index part is to transform the massive information crawled into a structured form which is convenient for users to retrieve and store in the index library. For web pages on the Internet, we can think of it as document information. In order to improve the user's experience and retrieve the results quickly, a strategy is to generate a word document format structure from the index library, which is inverted index. Using the form of inverted index, we only need to calculate the correlation between the search term and the index term, and then we can quickly find the corresponding document list containing the search term, so that we can quickly find the required documents according to the user's search term. The following describes several relatively common methods of index library generation. The first is the two pass document traversal method. Obviously, this strategy requires scanning documents twice. The first

2019101463 27 Nov 2019 scan mainly collects some relevant information and does not need to generate documents for the index library. The main function of the first scan is to do some resource preparation work. At the second pass, the generation of inverted index for each document is started. Because the twice scan method needs to scan the document set twice, the performance of this method is not very good based on efficiency considerations, so this method is not commonly used. The second is sequencing. Because the two pass traversal method is to complete the index in memory, it consumes a lot of memory, which may lead to insufficient memory, so the index cannot be established. The sorting method is improved. The sorting method always allocates fixed size memory. When the allocated memory is used up, the intermediate result is written to disk first, and the memory is emptied for the next index process. This strategy can be used to index documents of any size because it always allocates a certain amount of memory and has less strict requirements on memory size. The last is the merging method. The sorting method allocates a fixed size of memory to build an index. The sorting method only writes the intermediate results to disk, and the dictionary is still stored in memory. When there are more and more indexed documents, the dictionary will also be more and more, which will cause the dictionary to occupy more and more memory, and then the memory of the intermediate results will be insufficient.

The merging method is improved on this method. For the data written to the disk, not only the intermediate results, but also the dictionary is written to the disk, so that all the fixed memory can be used in the subsequent indexing process.

As we create the index. We want it more effective. So we should optimize the method. Optimizing the index can directly improve the system performance. This article will optimize the index from two aspects, one is the process of data indexing, the other is the process of data retrieval. (1) index data process optimization when importing data, we will build indexes according to the characteristics of the data. Improper setting will lead to very slow index data. At this time, we need to study the principle of search engine index, so as to optimize it. Many search engines are distributed search engines, so it will expand the distributed data when building the index, which requires the use of tranlo to balance the data between the nodes.

index.translog.flush_threshold_ops: 100000 index.refreshjnterval: -1, number_of_replicas:0. The first parameter indicates the trigger balance item, that is, when the number of tranlog data reaches, data balance will be performed. The default value of the system is 5000. Of course, the process of data balance will take a lot of time and hardware resources. Therefore, we can set this value higher or even turn off node data balancing, and then perform manual tranlog balancing according to actual needs. The second parameter indicates the refresh frequency, which is the index scheduled refresh time. The default value is 120s. In the process of importing data, the index has not been established, and data cannot be retrieved at this time. Wait until the indexYou cannot search until you commit to the system. Therefore, we need to refresh the index regularly to ensure real-time inspection Cable. When we first import massive data information, we can temporarily turn off the index refresh time, and then manually refresh after the index is established. Then refresh it according to the actual demand. The third parameter indicates the number of index copies. In order to ensure the reliability of data storage, the search engine will establish data copies. Therefore, in the process of index building, the data of

2019101463 27 Nov 2019 the search engine will be synchronized to the copies immediately, which is a bit of a waste of time. Therefore, when the system is indexing large amounts of data, we can set the number of copies to 0. After the index is completed, we can back it up and set the number of copies according to the actual situation. Through the above three steps, we can improve the efficiency of index building and reduce the time of index process. (2) we know that the speed of retrieval is closely related to the quality of index. The index quality of search engine is mainly related to the number of segments, copies and index segments. The number of index segments directly affects the retrieval speed. Too few segments will cause single segment index data Too many, when there are too many segments, we need to open more index files during the process of data retrieval. Increase communication between multiple index slices. Therefore, we need to set a reasonable number of index segments to improve the retrieval performance of the system. The number of index segments should be set as the total data divided by the number of single segments. The number of copies directly affects the index stability. If a partition is lost, the number of copies can be recovered immediately. According to, ensure the integrity of data. However, too many copies will lead to poor retrieval performance. So we need to balance the relationship between retrieval efficiency and data security.

DESCRIPTION OF DRAWINGS

Figure 1 is the structure of our text mining model Figure 2 is an example of model quality evaluation Figure 3 shows the differences between forward index and inverted index. Figure 4 is the structure of forward index and inverted index Figure 5 is the inverted index according to the searching result in Chinese. DESCRIPTION OF PREFERRED EMBODIMENT

Data analysis and retrieval

Text mining model

The search engine itself supports vector space model, TF-IDF, scrolling, similarity, etc. of text. Therefore, if we can transform collaborative to vector space, realizing recommendation engine based on search engine may be interesting. Figure 1 shows the structure of our text mining model.

Our text mining model can be used in content-based recommendation systems, automatic news aggregation, releasing internet public opinion information, automatic question answering, machine translation, spam filtering and so forth.

There are four steps to achieve text analysis and extraction: Text analysis

Feature Extraction

Searching and Classifying Clustering filtering After then we can get the Ul and presentation of searching results. According to it, users can do some searching and browse the results of searching so that the system feedback the results back to them.

Text Analysis Word segment is the first part of text analysis. Data processing, data processing, title recognition, PCS tagging make contribution to achieving the word segment. However,

2019101463 27 Nov 2019 traditional method faces the difficulties in segmenting Chinese texts into words. Unlike English, Chinese characters can represent different meanings and show different emotions. Besides, different orders may show the same meaning, even without standard syntax.

Because Chinese and English are different and there is no clear separation symbol between words, the first step is to use the word segmentation system to automatically segment documents into word sequences. In this way, each document is transformed into a data flow composed of a sequence of words. To analyse the Chinese semantic phrases, one must divide the sentences into words or N-gram. There are serval methods to achieve word segment.

Maximum Matching method: Select the symbol string containing 6-8 Chinese characters as the maximum symbol string, match the maximum symbol string with the word entries in the dictionary, if not, cut out a Chinese character to continue matching until the corresponding word is found in the dictionary. The matching direction is right to left. Reverse Maximum method: The matching direction is opposite to MM method, from left to right. The experimental results show that for Chinese, the reverse maximum matching method is more effective than the maximum matching method.

Bi-direction Matching method: Compare the word segmentation results of MM method and RMM method to determine the correct word segmentation.

Optimum Matching method: The words in the dictionary are arranged according to their frequency in the text. The words with high frequency are in the first place and the words with low frequency are in the second place, so as to improve the matching speed.

All in all, segmenting Chinese texts into word is based on the dictionary. Finding out the match string which containing some Chinese characters. Then making up to the Chinese word which can represent the right meaning. According to our research, Optimum Matching method is the most efficient method. So we choose this method and create inverted index lists.

Feature extraction

The typical problem of information retrieval research is based on user query (description Key words of required information), locate relevant documents in documents.

There are two point to judge whether a search engine is excellent or not.

Precision: The precision rate is the percentage of relevant documents in the retrieved documents, which measures the accuracy of the retrieval system.

Recall: The recall rate is the percentage of relevant documents in the retrieved documents, which measures the comprehensiveness of the retrieval system.

In this day, we often use precision as the standard to measure whether a search engine is excellent or not. The reason is that if the frequency of a word is quite high, it may not show this word is useful for distinguishing various texts.

Figure 2 shows an example of model quality evaluation. We can get these information: {relevant} = {A,B,C,D,E,F,G,H,I,J} = 10 {retrieved} = {D,E,F,L,M} = 5 {relevant} A {retrieved} ={D,E,F} - 3

Precision - 3 I 5 - 60%

Recall = 3/10 = 30%

2019101463 27 Nov 2019

To accomplish a search engine with high rate of precision, we should symbolic text first. JSON (JavaScript object notation) is a lightweight data exchange format. It is based on a subset of ECMAScript and uses a completely language independent text format to store and represent data. The simple and clear hierarchy makes JSON an ideal data exchange language. It is easy for people to read and write, but also easy for machines to parse and generate, and effectively improve the efficiency of network transmission. Transform the basic data structure into JSON, then retrieval begins.

Retrieval and classifying

First, create an inverted index lists to search for documents containing keywords. The retrieval system can quickly answer keyword queries.

Traditional method is based on cosine measure. The formula can be used to calculate the similarity between two documents.

Second, we introduce a structure named inverted index. Inverted index is an index structure that contains two hash table index tables or two B + tree index tables. We use this as the base of text information index and retrieval technology.

Because Chinese and English are different and there is no clear separation symbol between words. Inverted index list in Chinese should be created differently compared with that in English. Here is an example of inverted index in Chinese. After segmenting, we can get these lists to solve the problem.

(5) Clustering filtering

Text clustering is a process of classifying text data into different data classes according to their different characteristics. The purpose is to make the distance between texts of the same category as small as possible, and the distance between texts of different categories as large as possible.

There are some types of automatic document clustering.

Plane Dividing Method: For the sample set containing n samples, K partitions are constructed. Each partition represents a cluster.

Hierarchical Clustering Method: Hierarchical clustering is used to decompose the given sample set. According to the different directions of hierarchical decomposition, it can be divided into agglomerative hierarchical clustering and split hierarchical clustering.

Plane dividing method is fast, but we need to determine how many clustering should be created in advance, so seed selection is difficult.

Claims

We claim is:

1. Method of searching and mining of social information on Internet based on Elasticsearch, which is

The first step is data collection and data import;

The second step is to create index.
2. Method of searching and mining of social information on Internet based on Elasticsearch said in claim 1, wherein the said first step included:

Using a crawler is the easiest way for data collection; crawler mainly refers to the script back-end program for data collection from the Internet, which is the basis of data analysis and data mining; crawler have an advantage: it can be reused; we just need to design one program then you can have all the data.
3. Method of searching and mining of social information on Internet based on Elasticsearch said in claim 2, wherein said data import; there are 3 ways;

the first way is using java or python program to format data; then put it into search engine; the second way is using Plug-in unit; many search engines have their own plug-in unit which support format data importing into search machine; this way you can import data into search engines;

the third way is using tool which enable data to flow into search engines automatically; these tools have many requirements, logstash requires file must be json; we should format data before using tool.