AU2019101463A4 - Method of searching and mining of social information on Internet based on Elasticsearch - Google Patents

Method of searching and mining of social information on Internet based on Elasticsearch Download PDF

Info

Publication number
AU2019101463A4
AU2019101463A4 AU2019101463A AU2019101463A AU2019101463A4 AU 2019101463 A4 AU2019101463 A4 AU 2019101463A4 AU 2019101463 A AU2019101463 A AU 2019101463A AU 2019101463 A AU2019101463 A AU 2019101463A AU 2019101463 A4 AU2019101463 A4 AU 2019101463A4
Authority
AU
Australia
Prior art keywords
data
index
information
search
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019101463A
Inventor
Songhao Li
Donghang Sui
Qixin You
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to AU2019101463A priority Critical patent/AU2019101463A4/en
Application granted granted Critical
Publication of AU2019101463A4 publication Critical patent/AU2019101463A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention lies in the field of Information Retrival. It can help search and analyse social network information based on Elasticsearch. The invention consists of the following steps: Firstly, we use web crawler to acquire a sufficient number of information from the social media. Secondly, we use IK analyser to seperate each meaningful word or phrase. Thirdly, we imported the data into Elasticseach and created inverted index. By constantly adjusting the process of creating inverted index, the model will reach the optimal performance. Finally, when key words are input, the ranked revelant information would be displayed by the system. In brief, this invention can automatically select the information of a particular field that you are interested in. Text source Text analysis Word Segmentation Data Processing Date Processing Title Reco n POS Tagging Text Structure Analyzer Feature Extraction Characteristic words and weight Key words summary Specific information extraction Searching and classifying Clustering filtering UI And Presentation Search lt Browse Users Figure 1 ALL OF TEXTS ABOGH, l,J RELEVANT TEXTS DEF RETRIEVED AND RELEVANT EXTS \2 RIEVED TEXT Figure 2

Description

2019101463 27 Nov 2019
Title
Method of searching and mining of social information on Internet based on Elasticsearch Field of The Invention
The invention relates to the field of Information Retrival, which uses Elasticsearch to search and analyse Internet social data.
Background
With the development of the era of big data, Internet information is expanding. How Internet users use the Internet to search accurately and obtain the effective information they need has become an urgent problem to be solved and an important research direction of the current Internet industry. ES is a real-time distributed search and analysis engine, an open-source full-text search service based on Lucene, and a popular enterprise search engine. Es uses Lucene as its internal engine. When full-text search is carried out, it only needs to use the uniformly developed API. In addition, ES can realize distributed file storage and index every field so that it can be searched.
SUMMARY
The first step is data collection and data import.
Using a crawler is the easiest way for data collection. Crawler, also known as network crawler, mainly refers to the script back-end program for data collection from the Internet, which is the basis of data analysis and data mining. Crawler have a advantage: it can be reused. You just need to design one program then you can have all the data.
Talking about data import. We have 3 ways.
The first way is using java or python program to format data. Then put it into search engine.
The second way is using Plug-in unit. Many search engines have their own plug-in unit which support format data importing into search machine. This way you can import data into search engines.
The third way is using tool. These tools enable data to flow into search engines automatically. But these tools have many requirements. For example, logstash requires file must be json. So we have to format data before using tool.
The second step is to create index.
The creation of index library is the core technology of search engine. For the huge amount of Internet information obtained by the crawler, how to quickly find all web pages including user query words from the vast web page information belongs to the function that the indexer needs to complete. The main function of the index part is to transform the massive information crawled into a structured form which is convenient for users to retrieve and store in the index library. For web pages on the Internet, we can think of it as document information. In order to improve the user's experience and retrieve the results quickly, a strategy is to generate a word document format structure from the index library, which is inverted index. Using the form of inverted index, we only need to calculate the correlation between the search term and the index term, and then we can quickly find the corresponding document list containing the search term, so that we can quickly find the required documents according to the user's search term. The following describes several relatively common methods of index library generation. The first is the two pass document traversal method. Obviously, this strategy requires scanning documents twice. The first
2019101463 27 Nov 2019 scan mainly collects some relevant information and does not need to generate documents for the index library. The main function of the first scan is to do some resource preparation work. At the second pass, the generation of inverted index for each document is started. Because the twice scan method needs to scan the document set twice, the performance of this method is not very good based on efficiency considerations, so this method is not commonly used. The second is sequencing. Because the two pass traversal method is to complete the index in memory, it consumes a lot of memory, which may lead to insufficient memory, so the index cannot be established. The sorting method is improved. The sorting method always allocates fixed size memory. When the allocated memory is used up, the intermediate result is written to disk first, and the memory is emptied for the next index process. This strategy can be used to index documents of any size because it always allocates a certain amount of memory and has less strict requirements on memory size. The last is the merging method. The sorting method allocates a fixed size of memory to build an index. The sorting method only writes the intermediate results to disk, and the dictionary is still stored in memory. When there are more and more indexed documents, the dictionary will also be more and more, which will cause the dictionary to occupy more and more memory, and then the memory of the intermediate results will be insufficient.
The merging method is improved on this method. For the data written to the disk, not only the intermediate results, but also the dictionary is written to the disk, so that all the fixed memory can be used in the subsequent indexing process.
As we create the index. We want it more effective. So we should optimize the method. Optimizing the index can directly improve the system performance. This article will optimize the index from two aspects, one is the process of data indexing, the other is the process of data retrieval. (1) index data process optimization when importing data, we will build indexes according to the characteristics of the data. Improper setting will lead to very slow index data. At this time, we need to study the principle of search engine index, so as to optimize it. Many search engines are distributed search engines, so it will expand the distributed data when building the index, which requires the use of tranlo to balance the data between the nodes.
index.translog.flush_threshold_ops: 100000 index.refreshjnterval: -1, number_of_replicas:0. The first parameter indicates the trigger balance item, that is, when the number of tranlog data reaches, data balance will be performed. The default value of the system is 5000. Of course, the process of data balance will take a lot of time and hardware resources. Therefore, we can set this value higher or even turn off node data balancing, and then perform manual tranlog balancing according to actual needs. The second parameter indicates the refresh frequency, which is the index scheduled refresh time. The default value is 120s. In the process of importing data, the index has not been established, and data cannot be retrieved at this time. Wait until the indexYou cannot search until you commit to the system. Therefore, we need to refresh the index regularly to ensure real-time inspection Cable. When we first import massive data information, we can temporarily turn off the index refresh time, and then manually refresh after the index is established. Then refresh it according to the actual demand. The third parameter indicates the number of index copies. In order to ensure the reliability of data storage, the search engine will establish data copies. Therefore, in the process of index building, the data of
2019101463 27 Nov 2019 the search engine will be synchronized to the copies immediately, which is a bit of a waste of time. Therefore, when the system is indexing large amounts of data, we can set the number of copies to 0. After the index is completed, we can back it up and set the number of copies according to the actual situation. Through the above three steps, we can improve the efficiency of index building and reduce the time of index process. (2) we know that the speed of retrieval is closely related to the quality of index. The index quality of search engine is mainly related to the number of segments, copies and index segments. The number of index segments directly affects the retrieval speed. Too few segments will cause single segment index data Too many, when there are too many segments, we need to open more index files during the process of data retrieval. Increase communication between multiple index slices. Therefore, we need to set a reasonable number of index segments to improve the retrieval performance of the system. The number of index segments should be set as the total data divided by the number of single segments. The number of copies directly affects the index stability. If a partition is lost, the number of copies can be recovered immediately. According to, ensure the integrity of data. However, too many copies will lead to poor retrieval performance. So we need to balance the relationship between retrieval efficiency and data security.
DESCRIPTION OF DRAWINGS
Figure 1 is the structure of our text mining model Figure 2 is an example of model quality evaluation Figure 3 shows the differences between forward index and inverted index. Figure 4 is the structure of forward index and inverted index Figure 5 is the inverted index according to the searching result in Chinese. DESCRIPTION OF PREFERRED EMBODIMENT
Data analysis and retrieval
Text mining model
The search engine itself supports vector space model, TF-IDF, scrolling, similarity, etc. of text. Therefore, if we can transform collaborative to vector space, realizing recommendation engine based on search engine may be interesting. Figure 1 shows the structure of our text mining model.
Our text mining model can be used in content-based recommendation systems, automatic news aggregation, releasing internet public opinion information, automatic question answering, machine translation, spam filtering and so forth.
There are four steps to achieve text analysis and extraction: Text analysis
Feature Extraction
Searching and Classifying Clustering filtering After then we can get the Ul and presentation of searching results. According to it, users can do some searching and browse the results of searching so that the system feedback the results back to them.
Text Analysis Word segment is the first part of text analysis. Data processing, data processing, title recognition, PCS tagging make contribution to achieving the word segment. However,
2019101463 27 Nov 2019 traditional method faces the difficulties in segmenting Chinese texts into words. Unlike English, Chinese characters can represent different meanings and show different emotions. Besides, different orders may show the same meaning, even without standard syntax.
Because Chinese and English are different and there is no clear separation symbol between words, the first step is to use the word segmentation system to automatically segment documents into word sequences. In this way, each document is transformed into a data flow composed of a sequence of words. To analyse the Chinese semantic phrases, one must divide the sentences into words or N-gram. There are serval methods to achieve word segment.
Maximum Matching method: Select the symbol string containing 6-8 Chinese characters as the maximum symbol string, match the maximum symbol string with the word entries in the dictionary, if not, cut out a Chinese character to continue matching until the corresponding word is found in the dictionary. The matching direction is right to left. Reverse Maximum method: The matching direction is opposite to MM method, from left to right. The experimental results show that for Chinese, the reverse maximum matching method is more effective than the maximum matching method.
Bi-direction Matching method: Compare the word segmentation results of MM method and RMM method to determine the correct word segmentation.
Optimum Matching method: The words in the dictionary are arranged according to their frequency in the text. The words with high frequency are in the first place and the words with low frequency are in the second place, so as to improve the matching speed.
All in all, segmenting Chinese texts into word is based on the dictionary. Finding out the match string which containing some Chinese characters. Then making up to the Chinese word which can represent the right meaning. According to our research, Optimum Matching method is the most efficient method. So we choose this method and create inverted index lists.
Feature extraction
The typical problem of information retrieval research is based on user query (description Key words of required information), locate relevant documents in documents.
There are two point to judge whether a search engine is excellent or not.
Precision: The precision rate is the percentage of relevant documents in the retrieved documents, which measures the accuracy of the retrieval system.
Recall: The recall rate is the percentage of relevant documents in the retrieved documents, which measures the comprehensiveness of the retrieval system.
In this day, we often use precision as the standard to measure whether a search engine is excellent or not. The reason is that if the frequency of a word is quite high, it may not show this word is useful for distinguishing various texts.
Figure 2 shows an example of model quality evaluation. We can get these information: {relevant} = {A,B,C,D,E,F,G,H,I,J} = 10 {retrieved} = {D,E,F,L,M} = 5 {relevant} A {retrieved} ={D,E,F} - 3
Precision - 3 I 5 - 60%
Recall = 3/10 = 30%
2019101463 27 Nov 2019
To accomplish a search engine with high rate of precision, we should symbolic text first. JSON (JavaScript object notation) is a lightweight data exchange format. It is based on a subset of ECMAScript and uses a completely language independent text format to store and represent data. The simple and clear hierarchy makes JSON an ideal data exchange language. It is easy for people to read and write, but also easy for machines to parse and generate, and effectively improve the efficiency of network transmission. Transform the basic data structure into JSON, then retrieval begins.
Retrieval and classifying
First, create an inverted index lists to search for documents containing keywords. The retrieval system can quickly answer keyword queries.
Traditional method is based on cosine measure. The formula can be used to calculate the similarity between two documents.
Second, we introduce a structure named inverted index. Inverted index is an index structure that contains two hash table index tables or two B + tree index tables. We use this as the base of text information index and retrieval technology.
Because Chinese and English are different and there is no clear separation symbol between words. Inverted index list in Chinese should be created differently compared with that in English. Here is an example of inverted index in Chinese. After segmenting, we can get these lists to solve the problem.
(5) Clustering filtering
Text clustering is a process of classifying text data into different data classes according to their different characteristics. The purpose is to make the distance between texts of the same category as small as possible, and the distance between texts of different categories as large as possible.
There are some types of automatic document clustering.
Plane Dividing Method: For the sample set containing n samples, K partitions are constructed. Each partition represents a cluster.
Hierarchical Clustering Method: Hierarchical clustering is used to decompose the given sample set. According to the different directions of hierarchical decomposition, it can be divided into agglomerative hierarchical clustering and split hierarchical clustering.
Plane dividing method is fast, but we need to determine how many clustering should be created in advance, so seed selection is difficult.

Claims (3)

  1. We claim is:
    1. Method of searching and mining of social information on Internet based on Elasticsearch, which is
    The first step is data collection and data import;
    The second step is to create index.
  2. 2. Method of searching and mining of social information on Internet based on Elasticsearch said in claim 1, wherein the said first step included:
    Using a crawler is the easiest way for data collection; crawler mainly refers to the script back-end program for data collection from the Internet, which is the basis of data analysis and data mining; crawler have an advantage: it can be reused; we just need to design one program then you can have all the data.
  3. 3. Method of searching and mining of social information on Internet based on Elasticsearch said in claim 2, wherein said data import; there are 3 ways;
    the first way is using java or python program to format data; then put it into search engine; the second way is using Plug-in unit; many search engines have their own plug-in unit which support format data importing into search machine; this way you can import data into search engines;
    the third way is using tool which enable data to flow into search engines automatically; these tools have many requirements, logstash requires file must be json; we should format data before using tool.
AU2019101463A 2019-11-27 2019-11-27 Method of searching and mining of social information on Internet based on Elasticsearch Ceased AU2019101463A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019101463A AU2019101463A4 (en) 2019-11-27 2019-11-27 Method of searching and mining of social information on Internet based on Elasticsearch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019101463A AU2019101463A4 (en) 2019-11-27 2019-11-27 Method of searching and mining of social information on Internet based on Elasticsearch

Publications (1)

Publication Number Publication Date
AU2019101463A4 true AU2019101463A4 (en) 2020-01-23

Family

ID=69166824

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019101463A Ceased AU2019101463A4 (en) 2019-11-27 2019-11-27 Method of searching and mining of social information on Internet based on Elasticsearch

Country Status (1)

Country Link
AU (1) AU2019101463A4 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131449A (en) * 2020-09-21 2020-12-25 西北大学 Implementation method of cultural resource cascade query interface based on elastic search

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131449A (en) * 2020-09-21 2020-12-25 西北大学 Implementation method of cultural resource cascade query interface based on elastic search

Similar Documents

Publication Publication Date Title
US20160034512A1 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
Balakrishnan et al. Applying webtables in practice
JP5192475B2 (en) Object classification method and object classification system
JP6014725B2 (en) Retrieval and information providing method and system for single / multi-sentence natural language queries
Quercini et al. Entity discovery and annotation in tables
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN111522905A (en) Document searching method and device based on database
CN111061828B (en) Digital library knowledge retrieval method and device
CN101201838A (en) Method for improving searching engine based on keyword index using phrase index technique
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
AU2019101463A4 (en) Method of searching and mining of social information on Internet based on Elasticsearch
CN109783599A (en) Knowledge mapping search method and system based on multi storage
CN113553491A (en) Industrial big data search optimization method based on inverted index
Han et al. Design and implementation of elasticsearch for media data
Jutta et al. Linguistic variation in the Austrian Media Corpus. Dealing with the challenges of large amounts of data
Kim et al. Compact lexicon selection with spectral methods
LIM et al. Web mining-The ontology approach
Fabo et al. Mapping the Bentham Corpus: concept-based navigation
Al-Hamami et al. Development of an opinion blog mining system
CN109710844A (en) The method and apparatus for quick and precisely positioning file based on search engine
CN115687580B (en) Search reminder completion generation and reordering method, device, equipment and medium
CN113987146B (en) Dedicated intelligent question-answering system of electric power intranet
Rao et al. Legal Document Clustering and Summarization
TWI423053B (en) Domain Interpretation Data Retrieval Method and Its System

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry