CN112269913A

CN112269913A - Enterprise-level full data intelligent search implementation method and system

Info

Publication number: CN112269913A
Application number: CN202011174923.0A
Authority: CN
Inventors: 倪时龙; 张怀刚; 罗建新; 陈颖华; 郑敏; 钱新红
Original assignee: Fujian Zefu Software Co ltd
Current assignee: Fujian Zefu Software Co ltd
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-26

Abstract

An enterprise-level full data intelligent search method and system comprises the following steps: the system comprises an access layer, a model layer, an algorithm layer, an assembly layer, a service layer and a display layer, wherein the system design follows four unified principles, namely the principles of unified leadership, unified planning, unified standard and unified construction. The demonstration application construction should fully consider the safety protection, fault tolerance and anti-interference capability of the system, ensure the long-term stable, safe, reliable and efficient operation of the system, have good compatibility and expansibility, follow the design concept taking a customer as the center, provide consistency and humanized user experience, meet the actual needs of the service to the maximum extent, and have convenient operation, complete functions and friendly interface; the application design adopts an international advanced technical route. Fully utilizes old and heterogeneous compatible technology and protects the prior IT investment of national network companies. It is in line with international and national universal standards and supports various hardware platforms.

Description

Enterprise-level full data intelligent search implementation method and system

Technical Field

The invention relates to the technical field of information, in particular to a method and a system for realizing enterprise-level full data intelligent search.

Background

Information technologies such as cloud computing, big data, internet of things and mobile application are rapidly developed, companies develop a large amount of research and application, and a new opportunity is brought to the change of production modes and management modes of enterprises. During the thirteen-five period, the business of a company is upgraded from a partial support business operation to a comprehensive auxiliary analysis decision, comprehensive analysis and mining of data across business fields are required to be enhanced, analysis of semi-structured and unstructured data is enhanced, more regular and autonomous learning type data analysis is constructed, an information system is required to provide the capabilities of storage, rapid calculation and deep analysis and mining of massive and various types of data and rich information visualization display capability, uncertainty, randomness and subjectivity in the decision process are reduced to the maximum extent, the rationality, scientificity and rapid response degree of decision are enhanced, and the benefit and efficiency of decision are improved, so that the development direction of intelligent enterprises is led, and the progress is made in a more centralized, more intelligent and more interactive direction. Therefore, the data flow is accelerated, and the data utilization efficiency is improved, so that the method becomes a new challenge of facing to the application, providing quick and accurate data retrieval, efficient management and effective utilization.

The research focus and service performance of the search engine in different periods can divide the search engine into two generations:

first generation search engines appeared in 1994, represented by Yahoo, InfoSeek, AltaVista, etc., using manual or semi-manual indexing methods and keyword-based meta search techniques, with the goal of finding as many web pages as possible. The search engine generally indexes less than 100 ten thousand web pages, rarely re-collects the web pages and refreshes the index, has very slow retrieval speed, generally waits for 10 seconds or even longer, basically adopts mature IR (information retrieval), network, database and other technologies in the implementation technology, and is equivalent to the application on a WWW realized by utilizing some existing technologies.

The second generation search engine system appeared in 1996 mostly adopts a distributed scheme (multiple microcomputers work together) to improve data size, response speed and number of users, and generally maintains an index database of about 5,000 ten thousand web pages, which can respond to 1,000 ten thousand user retrieval requests each day, and the development direction is as follows: the size of index databases continues to increase, with typical commercial search engines remaining on the order of tens or even hundreds of millions of web pages.

1) Existing deficiencies and drawbacks

The existing search engine has more or less defects at present, which mainly appear in the following aspects:

2) logical operators

The query functions provided by existing search engines are quite limited, and most search engines only provide the most basic Boolean connections among keywords. For example, Yahoo only provides AND OR operations, AND once a logical operator is selected, it must be applied to all keywords. The Open Text Index allows users to use different Boolean operators, but only allows 4 operators and must operate in order of occurrence, and a query language as complex as the SQL language cannot be applied in the existing search engine.

3) Questioning using keywords only

Existing search engines only allow a question to be composed of a set of keywords and logical operators, but keyword retrieval does not fully satisfy the user's requirements, and it is a blind match, and natural language understanding is a very difficult task, and is still under study.

4) Inability to retrieve historical information

Every search of the user is from the beginning, and cannot be further refined from the original query result.

5) Simple result representation method

Most search engines return only a long search result list, typically several pages. The table may contain thousands of connection pointers to Web sites and the user may select only a small portion and discard the rest because the user may not be as patiently patienced and as a result they may lose much useful information.

6) Limitation of a single engine

As the amount of information on the Web is getting larger and larger, a single search engine cannot include the track of the whole network, the capability of an indexing robot, the size of an indexing database, the system maintenance overhead, and the like, which all limit the recall ratio of one search engine. Thus, the user must attempt to find the information he wants with all search engines. At worst, each engine is overlaid, and a user repeatedly finds a piece of information, and some solutions, such as a meta search engine and a distributed search engine, have appeared. In addition, it is reported that the main commercial search engine receives 1.5-2 ten thousand questions per minute, which is a great pressure on the index server.

7) It is difficult to provide effective personalized services for users

Because different users have different interests and hobbies, the required retrieval result also has certain pertinence, but the existing search engine cannot provide effective personalized service for a single user, so that the time for the user to inquire useful information is greatly increased. .

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a method and a system for realizing enterprise-level full data intelligent search.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: an enterprise-level full-data intelligent search system, comprising:

an access layer: the unstructured data source collection of unstructured, enterprise portal and knowledge management authority is realized;

a model layer: establishing an authority model, a business model, an interest domain model and a similarity model, and providing a modeling basis for data analysis;

and an algorithm layer: establishing characteristic value modeling, data analysis and retrieval, model evaluation and high-dimensional visualization algorithm analysis processes by using a data mining and analysis algorithm;

assembly layer: developing public building to support upper-layer service calling, realizing data relation analysis and association based on a data retrieval component with authority, an entity naming identification component, an automatic label component and the like, and providing a basic component for data comprehensive utilization;

and (3) a service layer: applying the model of the model layer and the algorithm of the algorithm layer to data, and encapsulating each component according to requirements to form a public service for supporting business;

a display layer: and displaying the retrieval result, wherein the retrieval result comprises an enterprise portal system, a knowledge management system and three sets of five major business systems.

The invention improves that the service layer comprises a cross-service retrieval service, a document association retrieval service, a related recommendation service and an automatic pushing service.

The invention has the improvement that a corpus is arranged in the access layer, the corpus is processed and constructed by comprehensively applying manual and program automation methods according to the requirements of the service fields and the data conditions, the related service data is combed according to the requirements of service scenes in the manual aspect, corresponding classification processing is carried out, and the corpus in the given field is characterized by mainly utilizing a word segmentation technology and a machine learning characteristic modeling and pattern analysis technology in the automatic construction aspect, so that different types of corpora are established.

The improvement of the invention is that the arithmetic layer algorithm comprises: association rule and sequence pattern algorithm, classification and prediction pattern algorithm, cluster analysis pattern algorithm and heterogeneous analysis pattern algorithm.

The invention further provides an enterprise-level full data intelligent search implementation method, which comprises the following steps:

accessing: the unstructured data source collection of unstructured, enterprise portal and knowledge management authority is realized;

model: establishing an authority model, a business model, an interest domain model and a similarity model, and providing a modeling basis for data analysis;

the algorithm is as follows: establishing characteristic value modeling, data analysis and retrieval, model evaluation and high-dimensional visualization algorithm analysis processes by using a data mining and analysis algorithm;

assembly of: developing public building to support upper-layer service calling, realizing data relation analysis and association based on a data retrieval component with authority, an entity naming identification component, an automatic label component and the like, and providing a basic component for data comprehensive utilization;

service: applying the model of the model layer and the algorithm of the algorithm layer to data, and encapsulating each component according to requirements to form a public service for supporting business;

and (3) displaying: and displaying the retrieval result.

The invention improves that the access comprises a component of a corpus, and specifically comprises two steps of construction:

step 1, unstructured data source collection of unstructured, enterprise portal and knowledge management authority, analyzing collected unstructured data, manually classifying the data, and storing the data into a corpus;

step 2, processing and constructing a corpus of the electric power dictionary through manual and program automation methods, combing related business data according to business scene requirements in the manual aspect, and performing corresponding classification processing, wherein in the automatic construction aspect, the corpus of a given field is characterized mainly by using a word segmentation technology and a machine learning feature modeling and pattern analysis technology;

and (5) manually combing the external dictionary and classifying the external dictionary and storing the external dictionary into a corpus.

The improvement of the invention also comprises a step of preprocessing the data of the corpus, and filtering is carried out through a word segmentation component, a filtering component and a user literary composition component.

The invention has the improvement that the algorithm comprises a word similarity algorithm, a document similarity algorithm, a user behavior analysis and a project characteristic analysis.

The invention has the improvement that the model step comprises fuzzy retrieval, an interest domain and a business relation map.

(III) advantageous effects

Compared with the prior art, the invention provides an enterprise-level full data intelligent search system, which has the following beneficial effects: the system design follows four unified principles, namely the principles of unified leadership, unified planning, unified standard and unified construction. The demonstration application construction should fully consider the safety protection, fault tolerance and anti-interference capability of the system, ensure the long-term stable, safe, reliable and efficient operation of the system, and have good compatibility and expansibility. The method follows a design concept taking a client as a center, provides consistent and humanized user experience, meets actual service requirements to the maximum extent, and is convenient to operate, complete in function and friendly in interface; the application design adopts an international advanced technical route. Fully utilizes old and heterogeneous compatible technology and protects the prior IT investment of national network companies. The method conforms to international and national universal standards, supports various hardware platforms, and has good openness and portability. And a standard open platform interface is adopted to support data exchange and sharing with other systems, so that maintenance, expansion and interconnection are facilitated.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a flow chart of the present invention;

fig. 3 is a case flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 3, an enterprise-level full data intelligent search system includes:

The technical route selection is carried out under the whole architecture of the national network, and according to the principle of localization and repeated construction reduction, an autonomous + mature open source software mode is selected to realize the rapid construction of the system, and the advancement and the stability of the architecture are ensured.

After researching the full-text retrieval product with mature source opening in the industry, the discovery that the ElasticSearch is an open-source, distributed and RESTful search engine constructed based on Lucene, is designed for distributed computation, and can achieve real-time, stable, reliable and rapid search.

and (3) displaying: and displaying the retrieval result.

And constructing a one-stop retrieval service, and providing data classification and pushing functions. And developing a batch transmission interface of the unstructured platform to realize automatic extraction, classification and clustering of unstructured data. The method has the advantages that the unstructured data are automatically collected to knowledge by combining enterprise-level knowledge management, classification, clustering and knowledge pushing from the unstructured data to the knowledge management are realized in a knowledge management question and answer bar, a knowledge construction module and a new knowledge recommendation module, the quality of knowledge is improved, the conversion from experience to knowledge is supported, and a foundation is laid for cross-business association use of data.

The novel system and the method provide different retrieval modes, and specifically comprise the following steps:

(1) comprehensive retrieval

The method has the advantages that the company full-data one-stop search service is constructed, the centralized retrieval of different business data is realized, the problem of cross-system and cross-business multi-source information dispersed search is solved, the comprehensive retrieval only realizes the expansion of retrieval breadth and depth, and the one-to-one accurate retrieval is realized.

The user can realize accurate retrieval, fuzzy retrieval and combined retrieval on certain logical relations based on single words and phrases in a search box and other corresponding visual search interfaces, the retrieval depth can be freely selected by the user, and specific range retrieval including title, text, attachments and other key attributes can be realized. For example, the user types in the term "five in three", and the system will return a result list of titles, text, attachments or other key attributes containing four words "five in three".

(2) Relevance retrieval

If the comprehensive retrieval is to retrieve scattered information points needed by the user from the massive information and return the result to the user in the form of knowledge points, the relevance retrieval is to serially connect the scattered information points at multiple angles through a certain relevance relationship and return the result to the user in the form of a relational network. In each retrieval, the user not only can obtain the comprehensive retrieval result, but also can know some new facts or new relations, so that the user can be prompted to carry out a series of new search queries, and the retrieval is more deep and extensive. The relevance retrieval is fuzzy and extension search, and one-to-many relative meaning retrieval is realized.

On the basis of comprehensive retrieval, a relation graph of the information points is constructed, and a complete knowledge network related to the search source and the result is constructed. And forming a relation degree sequence according to the relation degree of the information, and displaying the relation degrees layer by layer in a graphical visualization mode according to the internal and external dimensions. For example, the user types in the term "five in three", and the system will return a result list of titles, text, attachments or other key attributes containing four words "five in three". Meanwhile, relevant results such as human resources, finance, intensive management of materials, large planning, large construction, large production, large overhaul, large marketing system and the like related to the three sets and the five sets are returned, and relevant information such as twelve and four sets can also be returned.

On the basis of comprehensive retrieval, a venation map of the information points is constructed, and a complete information venation evolution relation related to a search source and a result is constructed. And forming a sequence according to key attributes such as the degree of relationship or time of the information, and displaying the information one by one in a graphical visualization mode according to the far and near dimensions. For example, the user types in the term "big overhaul scheme" and the system will return a result list with a title, body, attachment, or other key attribute containing five words of "big overhaul scheme". And meanwhile, returning a relational map related to the large overhaul system. In addition, the system returns key course information of the construction of the large overhaul system according to the time axis in a memorial mode. The context atlas only provides such a search for a portion of the event class search sources.

Meanwhile, the system also has an automatic pushing function, the retrieval can be divided into active retrieval and passive retrieval, the comprehensive retrieval and the relevance retrieval are active retrieval initiated by a user, and the automatic pushing is passive retrieval initiated by the system to the user. The system can complete the real-time active push of hot spot information and important information of a company without any retrieval operation by a user. Meanwhile, the system can also provide a preselected focus of attention for the user, and the system automatically finishes the automatic pushing of the information within the preselected range of the user.

Through project construction, one-stop intelligent search based on an unstructured data management platform is constructed, business, fusion, intelligence, initiative and individuation unstructured data information resource inlets are provided for users at all levels of a company, an unstructured big data value mining technical method based on an information retrieval layer is explored, construction cost of other business systems is reduced, and construction benefits of the unstructured data management platform are improved. The method specifically comprises the following three business targets:

1. the comprehensive search service of the total data of the company is constructed, the centralized search of different business data is realized, the problem of multi-source information dispersed search of cross-system and cross-business is solved, the comprehensive search only realizes the expansion of the search breadth and depth, and the one-to-one accurate search is realized.

2. On the basis of comprehensive retrieval results, the results are returned to the user in a knowledge point form, and then relevance retrieval is to carry out multi-angle series connection on scattered information points through a certain relevance relation and return the results to the user in a relation network form, so that the search is more deep and extensive.

3. The current passive retrieval situation is changed, the conversion from 'person finding data' to 'data finding person' is realized, the comprehensive retrieval and the relevance retrieval are active retrieval initiated by a user, and the automatic pushing is passive retrieval initiated by a system to the user.

The following presents a whole example by way of a simple example, with particular reference to fig. 3:

1. data acquisition: the data searched and collected by the enterprise comprise various databases (structured, semi-structured and unstructured), electronic documents, texts, multimedia and the like besides a webpage, and the data are cleaned by extracting and integrating heterogeneous data.

2. Modeling data: preprocessing (segmenting words, removing stop words, filtering virtual words and the like), feature representation, feature selection and feature weight calculation are carried out on the acquired data, a knowledge graph, a user interest model and a similarity model are established by adopting a text mining analysis algorithm, and model support is provided for data retrieval and display.

3. The user requests: and performing word segmentation and semantic understanding on the keywords or phrases input by the user, and bringing the recognized word segmentation and the user authority information into a search engine for query.

4. And (3) processing by a search engine: the retrieval request is firstly filtered by the authority, the index results meeting the authority are taken out, and the index results are ranked from high to low according to the degree of correlation by default.

5. The results show that: and based on the data model, visually displaying the retrieval result in a business association map form, an interest domain map form and the like.

In order to realize the final goal of cross-business, strong correlation and intelligentization one-stop search of enterprise information, the system is researched and developed according to the principle of point division and stage division. The project work target of the current period is mainly to realize the system framework construction of the one-stop search engine and complete the research and development of the system cross-business, strong association, automatic pushing and other important functions. And typical scenes are combed in the field of information-based construction, and the application effect is verified through trial run.

The new functions are as follows: the development work of 4 newly added function modules such as comprehensive retrieval (comprising 5 secondary modules such as cross-business system retrieval, accurate retrieval and fuzzy retrieval), relevance retrieval (comprising 2 secondary modules such as a relation map for constructing information points and a venation map for constructing information points), automatic pushing (comprising 5 secondary modules such as real-time information pushing, a user interest model and a recommendation algorithm), knowledge collection and one-stop retrieval service test point (comprising 2 secondary modules such as knowledge collection and one-stop retrieval service) is completed.

1) And completing the test point deployment implementation work at the headquarters of the company.

2) The data integration of 4 systems of cooperative office, IRS, knowledge management and portal is completed, the centralized retrieval of related materials in the information construction process is realized, and the working efficiency of staff is improved.

3) The integration of single sign-on with the portal is completed, and the retrieval and the use of the user are facilitated.

4) And completing integration of the unstructured platform and the unified authority.

The system functions are mainly divided into the following 5 blocks:

system and data integration: at this stage, the integrated service system is required to carry out the combing of the service model and the authority model, provide a unified search interface and realize the integration and the reconstruction of the service system.

Comprehensive retrieval: the method has the advantages that the company full-data one-stop search service is constructed, the centralized retrieval of different business data is realized, the problem of cross-system and cross-business multi-source information dispersed search is solved, the comprehensive retrieval only realizes the expansion of retrieval breadth and depth, and the one-to-one accurate retrieval is realized.

And (3) relevance retrieval: if the comprehensive retrieval is to retrieve scattered information points needed by the user from the massive information and return the result to the user in the form of knowledge points, the relevance retrieval is to serially connect the scattered information points at multiple angles through a certain relevance relationship and return the result to the user in the form of a relational network. In each retrieval, the user not only can obtain the comprehensive retrieval result, but also can know some new facts or new relations, so that the user can be prompted to carry out a series of new search queries, and the retrieval is more deep and extensive. The relevance retrieval is fuzzy and extension search, and one-to-many relative meaning retrieval is realized.

Automatic pushing: the retrieval can be divided into active retrieval and passive retrieval, the comprehensive retrieval and the relevance retrieval are active retrieval initiated by a user, and the automatic pushing is passive retrieval initiated by a system to the user.

Retrieval application: and analyzing according to the interest domain models such as the user identity information and the like, and displaying the recommendation result through knowledge management. After the user logs in the portal system, the user can input one or more keywords in a search interface of the portal system to initiate retrieval. Through the encryption and decryption algorithm, the user information is prevented from being stolen and tampered in the transmission process, and the user can be accessed without re-inputting a user name and a password during the second login full-text retrieval after the portal login.

And (3) system management: and the management of the user and the logging task is realized.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An enterprise-level full-data intelligent search system, comprising:

2. The system of claim 2, wherein the service layer comprises a cross-business retrieval service, a document association retrieval service, a related recommendation service, and an automatic push service.

3. The system according to claim 3, wherein a corpus is provided in the access layer, the corpus is processed and constructed by comprehensively applying manual and program automated methods according to business field requirements and data conditions, in the manual aspect, the related business data are sorted according to business scene requirements, and corresponding classification processing is performed, in the automated construction aspect, the corpus in a given field is characterized by mainly utilizing a word segmentation technology and a machine learning feature modeling and pattern analysis technology, and different types of corpora are established.

4. The system of claim 1, wherein the intra-arithmetic layer algorithm comprises: association rule and sequence pattern algorithm, classification and prediction pattern algorithm, cluster analysis pattern algorithm and heterogeneous analysis pattern algorithm.

5. An enterprise-level full data intelligent search implementation method is characterized by comprising the following steps:

and (3) displaying: and displaying the retrieval result.

6. The method for implementing enterprise-level full-data intelligent search according to claim 1, wherein the access includes a component of a corpus, and specifically includes two steps of construction:

7. The method of claim 6, further comprising a data preprocessing step for the corpus, filtering by a word segmentation component, a filtering component and a user's literary component.

8. The method of claim 5, wherein the algorithm steps include word similarity algorithm, document similarity algorithm, user behavior analysis, and project feature analysis.

9. The method of claim 5, wherein the model step comprises fuzzy search, interest domain and business relationship graph.