CN110825792A

CN110825792A - High-concurrency distributed data retrieval method based on golang middleware coroutine mode

Info

Publication number: CN110825792A
Application number: CN201911117727.7A
Authority: CN
Inventors: 苏学武; 杨刚; 赖冠; 龚波
Original assignee: ZHUHAI XINDEHUI INFORMATION TECHNOLOGY Co Ltd
Current assignee: ZHUHAI XINDEHUI INFORMATION TECHNOLOGY Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-02-21
Anticipated expiration: 2039-11-15
Also published as: CN110825792B

Abstract

The invention discloses a high-concurrency distributed data retrieval method based on a golang middleware coroutine mode, which comprises the following specific steps of: uploading data information to be acquired according to the provided user interaction page and the operation guide of the page; preprocessing the uploaded data information, and automatically adding indexes, types and type structures of the Elasticissearch; adding a timing acquisition task for data acquisition configuration on a configuration timer page; the system automatically schedules a data acquisition task executed through the golang middleware, and stores the data into an elastic search library; performing word segmentation and semantic analysis from the acquired data through an elastic search word segmentation and semantic analysis technology; and opening the data set to the user of the terminal by using the interface configuration page. The method introduces the golang co-project high concurrency technology, accelerates the data collecting and data arranging process to a certain extent, improves the collecting efficiency, and simultaneously adopts the technology of automatically removing repeated data, and improves the data utilization rate.

Description

High-concurrency distributed data retrieval method based on golang middleware coroutine mode

Technical Field

The invention relates to the technical field of database retrieval, in particular to a high-concurrency distributed data retrieval method based on a golang middleware coroutine mode, which is used for constructing resource synchronization of a public security database.

Background

Along with the rapid development of economy and science and technology in recent years, the informatization construction of the public security industry is also rapidly developed, but the problems of low data quality, poor processing capability, insufficient standard specification, insufficient sharing application, not deep professional application and the like are also accompanied. How to deal with the challenges brought by data resource quantification, isomerization, diversified and complicated application requirements and the like by means of technological strength is the key of information construction. However, the current situation of full-text search products is that each manufacturer is responsible for the product, and each manufacturer adopts different technical implementation schemes. The problems of data extraction and low efficiency of an external interface scheme appear due to the fact that a unified technical thought does not exist, and the situations that the interface is not universal, later-period maintenance is not timely and the like occur. Based on the above problems, the applicant compares and analyzes mainstream full-text search products in the existing market, and most of the full-text search products and the used technologies in the existing market have the following problems:

1. and retrieval function aspect: 1) the word hit rate is not high, and the category retrieval function is limited; 2) the word-cutting retrieval function is lacked; 3) the speed of taking information is far slower than the growth speed of network resources.

2. Data cleaning and data treatment: 1) data extraction confusion; 2) the data source is single, and the data storage mode is complex and slow and is not universal; 3) the unified technical thought is lacked, and the situations of low efficiency and non-universal interface exist in the external interface scheme.

3. In other aspects: 1) the compatibility is insufficient, and the method is only suitable for products with peripheral forms of the Internet; 2) the product has strong requirements on technical operation, is fussy to operate, and cannot provide a good application scene adaptive to diversity; 3) later maintenance is not timely, data updating is not timely, and the performance of data flow logs is lack, so that high requirements on hardware are required for tuning.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a high-concurrency distributed data retrieval method based on a golang middleware coroutine mode, and a set of simple and easy-to-use web configuration pages is designed and developed to solve the problems of single extraction data source, complex interactive interface design, complex and slow data storage mode and high data storage difficulty, so that the data acquisition efficiency and the data application efficiency are effectively improved, the later maintenance is ensured, and conditions are prepared for strengthening law enforcement regulations and improving the law enforcement efficiency.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows.

The high-concurrency distributed data retrieval method based on the golang middleware coroutine mode comprises the following specific steps of:

A. uploading data information to be acquired according to the provided user interaction page and an operation guide of the page, and then uploading configuration information of data acquisition;

B. b, preprocessing the data information uploaded in the step A to form a corresponding data structure rule, and automatically adding indexes, types and type structures of the elastic search;

C. after the collected data environment in the step B is arranged, according to the designed collected data configuration, adding a timing collection task for the collected data configuration on a configuration timer page;

D. c, configuring the timed task data in the step C, automatically scheduling the system to execute a data acquisition task through a golang middleware, and storing the data into an elastic search library;

E. when data enters an elastic search library, performing word segmentation and semantic analysis on the acquired data through word segmentation and semantic analysis technologies of the elastic search to obtain a final data set to be stored in a warehouse;

F. and opening the data set to the user of the terminal by using the interface configuration page.

Further optimizing the technical scheme, wherein the data information comprises text data and text data configuration information.

Further optimizing the technical scheme, wherein the data information comprises configuration database connection information and table information.

And B, further optimizing the technical scheme, wherein in the step B, the characteristic rule is used as a basis for page rendering, data sorting and data storage.

In the step B, the automatic addition of the index, type and type structure of the Elasticsearch is performed by adding a text directory or adding a database and a table in combination with a system background automation program according to a set of configured data structure mapping.

And C, further optimizing the technical scheme, wherein in the step C, the collected data environment is sorted by combining an automatic mode and a manual input configuration mode.

Further optimizing the technical scheme, wherein the step D comprises the following specific steps:

D1. landing data to be put in a database into a server local file through a golang code, storing a mapping relation between an input source and an output source in a program, and storing a related log;

D2. comparing the ground file with data in an index mapped by the Elasticissearch, filtering illegal data, screening out data needing to be put in a storage and storing the data in the storage into a memory;

D3. and importing the filtered data into an index of an Elasticissearch mapping through a high concurrency multiple protocol.

In the step D2, the data comparison is to classify and screen out the data mainly by using knn algorithm.

And E, further optimizing a technical scheme, wherein in the step E, the word segmentation and semantic analysis technology mainly uses a jieba word segmentation device to realize word segmentation by the following algorithm:

E1. realizing efficient word graph scanning based on a prefix dictionary, and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence;

E2. a maximum probability path is searched by adopting dynamic programming, and a maximum segmentation combination based on word frequency is found out;

E3. for unknown words, an HMM model based on Chinese character word forming capability is adopted, a Viterbi algorithm is used, and pinyin is converted into Chinese characters and characters are segmented through a large number of real data.

Due to the adoption of the technical scheme, the technical progress of the invention is as follows.

The invention realizes the import of various data into the database by adopting a visual mode to form corresponding rules, the defined rules can be used as the basis of page rendering, data arrangement and data storage, and the acquisition method comprising a configuration mode ensures that business personnel can complete transverse expansion through the provided configuration function under the condition of not needing participation of developers, thereby meeting the acquisition requirements of various data sources, simultaneously reducing the workload of the developers to a certain extent and reducing the coupling degree of codes. The method effectively solves the problems of single extraction data source, complex design of an interactive interface, complex and slow data storage mode and large data storage difficulty, effectively improves the data acquisition efficiency and the data application efficiency, ensures timely maintenance in the later period, and prepares conditions for strengthening law enforcement standards and improving the law enforcement efficiency.

The invention combines the configuration design and the relational database application to realize the acquisition of various data of heterogeneous data sources and ensure the robustness and the robustness of the acquisition method.

The method introduces the golang co-project high concurrency technology, accelerates the data collecting and data arranging process to a certain extent, improves the collecting efficiency, and simultaneously adopts the technology of automatically removing repeated data, and improves the data utilization rate.

The invention adopts the word segmentation technology and the semantic analysis technology to carry out deep analysis on the extracted text information so as to extract element information with higher data value, provide data support for realizing more subsequent upper-layer applications, fully exert data efficiency and provide assistance for automatic data arrangement and manual data arrangement of the acquisition method.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the figures and specific examples.

A high concurrency distributed data retrieval method based on a golang middleware coroutine mode is characterized in that function development is carried out by combining characteristics of the golang, the high concurrency and the multiprotocol advantage of the golang language can be exerted, data processing and ES (electronic storage) importing are carried out, and full-text retrieval can be achieved. The high concurrency is one of factors which must be considered in the architecture design of the internet distributed system, and generally means that the system can simultaneously process a plurality of requests in parallel by design; the execution of the coroutine only needs 2kb of memory, thousands of concurrent tasks can be simultaneously operated, and the occupied memory is small. The cluster processing of more than 4 ten thousand processed data per second and large rules can be realized, and more than 10 ten thousand processed data per second can be realized.

The high-concurrency distributed data retrieval method based on the golang middleware coroutine mode is shown in the combined figure 1 and comprises the following specific steps:

A. and uploading data information to be acquired according to the provided user interaction page and the operation guide of the page, and then uploading configuration information of data acquisition. The data information includes text data and text data configuration information, or the data information includes configuration database connection information and table information.

B. And B, preprocessing the data information uploaded in the step A to form a corresponding data structure rule, and automatically adding the index, the type and the type structure of the Elasticissearch. Data structures are the way computers store, organize, etc. data. A data structure refers to a collection of data elements that have one or more specific relationships to each other.

The automatic addition of the index, the type and the type structure of the Elasticissearch is carried out by combining a background automation program of the system to add a text directory or add a database and a table and mapping according to a set of configured data structures. Database type in this step: oracle \ mysql \ postgresql.

C. And C, finishing the arrangement of the acquired data environment in the step B by combining an automatic mode and a manual input configuration mode, and adding a timing acquisition task for the acquired data configuration on a configuration timer page according to the designed acquired data configuration.

Automation means that a system automatically builds a data structure and automatically synchronizes data. The manual input configuration refers to manual configuration of a data source and data scheduling.

D. And C, configuring the timing task data in the step C, automatically scheduling the system to execute a data acquisition task through the golang middleware, and storing and merging the data into an elastic search library.

The step D comprises the following specific steps:

D1. and landing the data to be put in storage into a local file of the server through a golang code, storing the mapping relation between an input source and an output source in a program, and storing a related log.

D2. And comparing the floor file with data in the index mapped by the Elasticissearch, filtering illegal data, screening out data needing to be put in a storage and storing the data in the storage.

Illegal data refers to data with abnormal format and data value exceeding the set normal range.

The data comparison refers to that a database data set is inquired in the glong middleware protocol and compared with an elastic search data set through configured keywords, and the data comparison is mainly realized by adopting an knn algorithm, so that repeated data are automatically subjected to deduplication and sorting.

knn is a basic classification and regression method, which has the rule that samples of the same class are gathered in a feature space, and the data can be classified and screened by the algorithm.

E. When data enters the elastic search library, word segmentation and semantic analysis are carried out on the collected data through word segmentation and semantic analysis technologies of the elastic search, more valuable information is extracted from the collected data, and the search hit rate and the search speed are improved. And obtaining a final data set to be put in storage.

The word segmentation and semantic analysis technology mainly uses a jieba word segmentation device to realize word segmentation by the following algorithm:

E1. efficient word graph scanning is achieved based on the prefix dictionary, and a Directed Acyclic Graph (DAG) formed by all possible word forming conditions of Chinese characters in the sentence is generated.

E2. And searching a maximum probability path by adopting dynamic programming, and finding out a maximum segmentation combination based on the word frequency.

E3. For unknown words, in order to convert pinyin into Chinese characters and divide characters into words, an HMM model is adopted, a Viterbi algorithm is used, the optimal result is calculated by the algorithm through a large number of real data, and the algorithm principle is as follows:

the probability distribution of each state St in the random process is only related to its previous state St-1, i.e. P (St | S1, S2, S3, …, St-1) ═ P (St | St-1).

The steps of the viterbi algorithm are summarized as follows:

if the most probable path p (or shortest path) passes through a certain point, such as X22 on the way, the starting point S on this path to the sub-path Q of X22 must be the shortest path between S and X22. Otherwise, replacing Q with the shortest path R from S to X22 constitutes a shorter path than P, which is clearly contradictory. The principle of satisfaction of optimality is demonstrated.

The path from S to E must pass through a certain state at the ith time, and assuming that there are k states at the ith time, if the shortest paths of all k nodes from S to the ith state are recorded, the final shortest path must pass through one of them, so that at any time, only the very limited shortest path is considered.

In connection with the above two points, assuming that when we enter the state i +1 from the state i, the shortest paths from S to the nodes on the state i are found and recorded on the nodes, then when calculating the shortest path from the starting point S to a certain node Xi +1 of the i +1 th state, it is only necessary to consider the shortest paths from S to all k nodes of the previous state i and the distance from the node to Xi +1, j.

F. The data set can be opened to the user of the terminal for use by using the interface configuration page.

Claims

1. The high-concurrency distributed data retrieval method based on the golang middleware coroutine mode is characterized by comprising the following specific steps of:

2. The method for highly concurrent distributed data retrieval based on the golang middleware coroutine mode as claimed in claim 1, wherein the data information comprises text data and text data configuration information.

3. The method for highly concurrent distributed data retrieval based on the golang middleware coroutine mode as claimed in claim 1, wherein the data information comprises configuration database connection information and table information.

4. The method for highly concurrent distributed data retrieval based on the golang middleware coroutine mode as claimed in claim 1, wherein in the step B, the feature rules are used as the basis for page rendering, data arrangement and data storage.

5. The highly concurrent distributed data retrieval method based on the golang middleware coroutine mode as claimed in claim 1, wherein in the step B, the automatic addition of the index, type and type structure of the Elasticsearch is performed by adding a text directory or adding a database and a table in combination with a system background automation program according to a set of configured data structure mapping.

6. The method for highly concurrent distributed data retrieval based on the golang middleware coroutine mode as claimed in claim 1, wherein in the step C, the collected data environment is arranged by combining automation and manual input configuration.

7. The method for highly concurrent distributed data retrieval based on the golang middleware coroutine mode as claimed in claim 1, wherein the step D comprises the following specific steps:

8. The method for highly concurrent distributed data retrieval based on the golang middleware assistant program mode as claimed in claim 7, wherein in the step D2, the data comparison is to mainly use knn algorithm to classify and screen out the data.

9. The high-concurrency distributed data retrieval method based on the golang middleware coroutine mode as claimed in claim 1, wherein in said step E, the word segmentation and semantic analysis technique mainly uses a jieba word segmenter to implement word segmentation by the following algorithm: