CN113051898A - Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language - Google Patents

Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language Download PDF

Info

Publication number
CN113051898A
CN113051898A CN201911372759.1A CN201911372759A CN113051898A CN 113051898 A CN113051898 A CN 113051898A CN 201911372759 A CN201911372759 A CN 201911372759A CN 113051898 A CN113051898 A CN 113051898A
Authority
CN
China
Prior art keywords
dimension
word
dictionary
virtual
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911372759.1A
Other languages
Chinese (zh)
Inventor
余宙
杨永智
陈文佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Abbott Technology Co Ltd
Original Assignee
Beijing Abbott Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Abbott Technology Co Ltd filed Critical Beijing Abbott Technology Co Ltd
Priority to CN201911372759.1A priority Critical patent/CN113051898A/en
Publication of CN113051898A publication Critical patent/CN113051898A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language. Defining virtual dimension data in a configuration library, and updating a dictionary; in a configuration library, each type of entity is provided with a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data are defined in the virtual dimension table; in the process of searching the structured data, for the natural language input by the user, firstly, segmenting words by using a personal dictionary, segmenting words which are not recognized by the personal dictionary, and then segmenting words by using a system dictionary, so that the natural language input by the user is translated into a database query language; when more than N personal dictionaries will define the same word sense for the same word, the word sense of the word is synchronized from the herringbone dictionary to the system dictionary. The invention has accurate and rapid word meaning accumulation.

Description

Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language
Technical Field
The invention relates to a word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language.
Background
Structured data are searched based on a Natural Language Processing (NLP) technology, namely, a natural language (Query) is translated into a database Query language (SQL), and data information contained in each table in a database is queried, so that the method is one of the technologies which are mainly developed in the field of universal search engines and vertical database Query in various industries. Because of different knowledge reserves, different familiarity with the database table structure of the searched database and the preference of deletion and simplification of human beings, compared with the standardized name which can be addressed by a search engine and is based on the database data, the input natural language vocabulary has extremely high non-standardized probability, so that the search success rate is low, the user frustration and confusion accumulate, and finally the product viscosity is lost.
At present, in the process of searching structured data based on a natural language processing technology, if the situation that input words cannot be recognized occurs, the words which can be intelligently recognized by a search engine can be perfected and expanded through semantic accumulation, so that the semantic recognition capability of the search engine is improved. Semantic accumulation is to add a word that cannot be recognized in the dictionary and to specify where in the database the word represents the meaning. Common semantic accumulation methods are:
1. and matching synonyms. When the vocabulary which can not be identified by the search engine is the field name and the field value of the table, synonyms are added to the field name and the field value. For example, a synonym table is newly created, the synonym table comprises fields such as main words, synonyms and labels, and the description of all possible dimension names (such as company short names), dimension values (such as couchgrass of Guizhou) and index names (such as net profits) to be searched is generated under the field of the main words of the dictionary table; and finding out synonyms of a batch of main words in an initial stage, importing the synonyms into a synonym field corresponding to the main words in a batch manner, and when a user uses synonym search, finding out the main words corresponding to the synonyms through a dictionary table by a search engine, and searching through the main words. Namely, the semantic recognition capability of a search engine is expanded by expanding the synonym range of the main words; in daily use, the search system can continuously collect feedback of users and continuously increase synonyms of main words.
2. A derivative calculation field is added. The derived calculation field is defined by a formula formed by the basic field, the function, the operator and the like. Firstly, determining the name of a derivative calculation field to be created, a dependent basic field and a calculation formula; newly building a derivative calculation field table, and importing the information including but not limited to the above information into the table; and generating words for the derived calculation field names in the table, so that when a user searches for a certain derived field name, the search engine corresponds to the calculation formula to perform real-time calculation and output results. The search system may continually add derivative calculation fields during daily use.
3. Adding a nomenclature. The term "professional term" generally refers to a word or phrase formed by combining, summarizing and refining more dimensions, indexes or conditions, such as "net profit income, basic earnings, net equity rate of equity, earnings, equity rate of equity, unallocated profit sale gross interest rate, etc.
The above method has the following disadvantages:
1. whether synonym matching or derivative calculation field adding is carried out, only one word stock exists in the existing search system, and the word senses of all words in the word stock are not disputed. For example, adding "profit" as a synonym for "net profit" requires a determination that most users recognize the two as synonyms. If some users, such as 1/5, consider "profit" to be a synonym for "profit margin" or "gross profit," then this synonym cannot be added to the thesaurus because it must be uniform across the entire network.
2. Depending on the working efficiency of the system-level maintainers, if the demand feedback of a certain word is added, the word cannot be processed by the background system-level maintainers in time, and the work of the user cannot be used continuously.
3. The configuration semantics of the existing method all depend on the existing data in the database, and for the data with dimension attributes, the semantic recognition capability cannot be expanded by adding synonyms, defining derivative calculation formulas and defining professional terms in the existing dictionary. For example, a concept strand of blossoms is searched ("blossoms say" is a comprehensive program), field values (dimensional values) such as "artificial intelligence", "block chains", and "pork" are provided in a database under a concept field (dimensional name), and the blossoms say otherwise, the blossoms say that the blossoms do not exist, and at this time, the blossoms say that the blossoms cannot be explained by means of synonyms, professional terms and the like based on existing data in the database.
Disclosure of Invention
The invention provides a method, a tool and a system for word meaning accumulation and word segmentation of structured data for natural language search.
In order to solve the technical problem, the invention provides a word sense accumulation method for structured data of natural language search, which defines virtual dimension data in a configuration library and updates a dictionary for segmenting a keyword string submitted by a user; the virtual dimension data refers to dimension data which is not included by an entity in the structured data serving as a search object, and the dimension data comprises a dimension name and a dimension value; in the configuration library, each type of entity has a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data is defined in the virtual dimension table.
Preferably, the method for defining the virtual dimension data comprises: when the unrecognized word has the dimension attribute in the structured data, determining an entity corresponding to the unrecognized word; if the unidentified word is a dimension name, establishing a dimension column named by the dimension name in a virtual dimension table, and assigning a dimension value to the entity under the dimension name; and if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table.
Preferably, the search system includes a system dictionary and a personal dictionary; the personal dictionary is a dictionary defined by a specific user, and corresponds to a specific user ID; the system dictionary is a dictionary shared by unspecified users; the defined virtual dimension data is updated to the personal dictionary.
Preferably, word senses are also defined in the configuration repository and updated to the personal dictionary in one or more of the following ways:
the first method is as follows: creating or supplementing a synonym table;
the second method comprises the following steps: newly creating or supplementing a glossary;
the third method comprises the following steps: and newly creating or supplementing a derivative calculation field table.
Preferably, the word segmentation method for searching the structured data by the natural language comprises the word sense accumulation method, and in the process of searching the structured data, for the natural language input by the user, the word segmentation is performed by using the personal dictionary firstly, the word which is not recognized by the personal dictionary is segmented, and then the word segmentation is performed by using the system dictionary, so that the natural language input by the user is translated into the database query language.
Preferably, the word segmentation method for the structured data oriented to natural language search includes the aforementioned word sense accumulation method; when more than N personal dictionaries will define the same word sense for the same word, the word sense of the word is synchronized from the herringbone dictionary to the system dictionary.
The invention also provides a word meaning accumulation tool for searching the structured data by the natural language, which comprises a virtual dimension data definition device, a word meaning accumulation module and a word meaning storage module, wherein the virtual dimension data definition device is used for defining virtual dimension data in a configuration library and further updating a dictionary; the virtual dimension data refers to dimension data which is not included by an entity in the structured data serving as a search object, and the dimension data comprises a dimension name and a dimension value; in the configuration library, each type of entity has a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data is defined in the virtual dimension table.
Preferably, the virtual dimension data definition device comprises an unidentified word entity determining unit, a virtual dimension table editing unit and a virtual dimension table storage unit; the unidentified word entity determining unit is used for determining the entity of the unidentified word when the unidentified word has the dimension attribute in the structured data; the virtual dimension table editing unit is used for editing the virtual dimension table, if the unrecognized word is a dimension name, a dimension column named by the dimension name is newly built in the virtual dimension table, and a dimension value is assigned to the entity under the dimension name; if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table; and the virtual dimension table storage unit is used for storing the virtual dimension table in a configuration library.
Preferably, the search system includes a system dictionary, a personal dictionary, and a dictionary generating device; the personal dictionary is a dictionary defined by a specific user, and corresponds to a specific user ID; the system dictionary is a dictionary shared by unspecified users; the dictionary generating means updates the defined virtual dimension data to the personal dictionary.
Preferably, one or more of a synonym table definition device, a glossary definition device and a derivative calculation field table definition device are also included; the synonym table definition device is used for creating or supplementing a synonym table; the glossary definition device is used for newly building or supplementing a glossary; the derivative calculation field table is used for creating or supplementing a derivative calculation field table; the newly created or supplemented synonym table, the glossary, or the derived calculation field table is updated to the personal dictionary by the dictionary creating means.
The invention also provides a word segmentation system for searching structured data by natural language, which comprises the word meaning accumulation tool; in the process of searching the structured data, the natural language input by the user is firstly segmented by using the personal dictionary, the words which are not recognized by the personal dictionary are segmented by using the system dictionary, and thus the natural language input by the user is translated into the database query language.
Compared with the prior art, the invention has the obvious advantages that,
(1) for the dimension data which can not accumulate semanteme by adding synonyms, defining derivative calculation formulas and defining professional terms in the existing dictionary, the invention realizes the word meaning accumulation of the dimension data vocabulary by self-defining the virtual dimension data on the premise of being incapable of or not needing to change the dimension information of the existing data of the structured database.
(2) The invention sets personal dictionary and system dictionary at the same time, user can realize semantic accumulation in personal dictionary timely, conveniently and quickly according to own actual service requirement, without depending on system level to maintain system dictionary to accumulate corresponding vocabulary, for each personal user, not only improving the speed of word meaning accumulation, but also improving the precision of word meaning accumulation.
(3) The invention simultaneously sets a personal dictionary and a system dictionary, and realizes coexistence of two word banks. Therefore, even if the semantics accumulated for the same vocabulary by different users are different, the specific user is not prevented from accumulating the semantics by using the specific vocabulary. That is, the semantic accumulation in the personal dictionary is not limited by other users in the whole network, and meanwhile, the semantics accumulated in the personal dictionary does not influence other users in the whole network, so that the whole network cognition unification is not required to be forced for the word meaning accumulation.
(4) In the invention, when the same semantic accumulation is carried out on more than a certain number of personal dictionaries, the system can automatically synchronize the same semantic accumulation to the system dictionary, thereby not only improving the semantic accumulation speed of the system dictionary, but also leading the semantics accumulated in the system dictionary to better accord with the use requirements of most users in the whole network through a threshold judgment rule, and further improving the accuracy of the semantic accumulation in the system dictionary on the whole.
(5) In the invention, the personal dictionary is firstly adopted for word segmentation, and then the system dictionary is used for word segmentation, thus not only realizing the personalized and high-precision semantic accumulation of the items, but also realizing more accurate and more comprehensive word segmentation result because the personal dictionary is firstly used for word segmentation.
(6) Based on the beneficial effects, the word segmentation method and the word segmentation tool can more accurately segment the natural language search sentences, and ensure the user viscosity.
Drawings
FIG. 1 is a flow chart of the word sense accumulation and word segmentation method and system for structured data oriented to natural language search according to the present invention.
FIG. 2 is a schematic view of the retraction system of the present invention.
FIG. 3 is a diagram illustrating word sense accumulation in a personal dictionary according to the present invention.
FIG. 4 is a schematic diagram of the word sense integration tool of the present invention.
FIG. 5 is a schematic view of a search results interface in an example of the invention.
FIG. 6 is a diagram of a custom dimension interface in an example of the invention.
FIG. 7 is a diagram of a custom dimension editing interface in an example of the invention.
FIG. 8 is a diagram of a re-search results interface in an example of the invention.
FIG. 9 is a diagram of a synonym interface in an example of the present invention.
FIG. 10 is a diagram of a collocation nomenclature interface in an example of the invention.
FIG. 11 is a diagram of a personal dictionary synchronization to system dictionary interface in an example of the present invention.
Detailed Description
It is easily understood that various embodiments of the present invention can be conceived by those skilled in the art according to the technical solution of the present invention without changing the essential spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete. The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the innovative concepts of the invention.
Generally, words that cannot be recognized by a search engine can be classified into the following four categories:
(1) the unrecognizable word is a synonym of a certain recognizable word, such as "PE" is a synonym of "city profit rate";
(2) unrecognizable words are terms of art. The term of expertise generally refers to a word or phrase formed by combining, summarizing and refining more dimensions, indexes or conditions, such as "net profit income, basic earnings, net equity rate of equity assets, liabilities rate of equity assets, unallocated earnings, and net profit rate of sales";
(3) the unrecognizable word is a derivative calculation field, such as equal price ═ sum (volume of bargain)/sum (volume of bargain);
(4) the unrecognizable word is a new dimension name or value in the structured data, as mentioned in the background of the invention "blossoming".
In view of the above situation, an embodiment of the present invention proposes a semantic accumulation method for searching structured data in a natural language, in which semantic accumulation is performed by using different methods for the above four types, respectively.
For the situation that the unrecognizable word is a synonym of a certain recognizable word, a synonym table is newly built or supplemented in a configuration library and is updated to a dictionary for word segmentation; the synonym table comprises fields such as main words, synonyms and labels, and all descriptions of dimension names (such as company short names), dimension values (such as couchgrass in Guizhou) and index names (such as net profits) which are possibly searched are generated under the field of the main words of the synonym table; and finding out synonyms of a batch of main words in an initial stage, importing the synonyms into a synonym field corresponding to the main words in a batch manner, and when a user uses synonym search, finding out the main words corresponding to the synonyms through a dictionary table by a search engine, and searching through the main words. Namely, the semantic recognition capability of a search engine is expanded by expanding the synonym range of the main words; in daily use, the search system can continuously collect feedback of users and continuously increase synonyms of main words.
If the unrecognized word is a term, a term is newly built in a term table of the configuration library (if the term table is not available, the term is newly built), namely, a word or phrase which does not exist in a database is defined by using the word combination description which exists in the dictionary. The user later searches the term, and the search engine finds the corresponding definition to search. The search system may continually collect increasing terms during daily use.
If the unrecognized word is a derivative calculation field, establishing a derivative calculation field table in the configuration library, and defining a derivative calculation formula by using basic fields, functions, operators and the like required by the derivative calculation word. And generating words for the derived calculation field names in the table, so that when a user searches for a certain derived field name, the search engine corresponds to the calculation formula to perform real-time calculation and output results. The search system may continually add derivative calculation fields during daily use.
If the unrecognizable word is data with dimension attributes in the structured data, such as a dimension name or a dimension value, because semantic accumulation cannot be performed by the three methods, and in order to accumulate semantics without changing the structured data, the embodiment defines virtual dimension data in the configuration library, which is used for segmenting a keyword string submitted by a user for query; the virtual dimension data refers to dimension data which is not included by an entity in the structured data serving as a search object, and the dimension data includes a dimension name and a dimension value. In the configuration library, for each user, each entity type has a virtual dimension table, which includes a name column of a specific instance below the entity and all dimension columns that need to be customized.
The method for defining the virtual dimension data comprises the following steps:
s100, when the unrecognized word has the dimension attribute in the structured data, determining the entity of the unrecognized word;
s200, if the unidentified word is a dimension name, establishing a dimension column named by the dimension name in a virtual dimension table, and assigning a dimension value to the entity under the dimension name;
s300, if the unidentified word is a dimension value, determining a dimension name corresponding to the dimension value, and then adding the unidentified word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table.
By the aid of the user-defined virtual dimension data, word meaning accumulation of dimension data vocabularies is achieved on the premise that existing data dimension information of the structured database cannot be or does not need to be changed.
In one embodiment of the invention, a search system includes a system dictionary and a personal dictionary; the personal dictionary is a dictionary defined by a specific user, and corresponds to a specific user ID; the system dictionary is a dictionary common to unspecified users. The synonym table, the term table, the derivative calculation field table and the virtual dimension table used for accumulating word senses by the four methods are updated to the personal dictionary by the dictionary generating device after being defined in the configuration library. In the process of searching the structured data, the natural language input by the user is firstly segmented by using the personal dictionary, the words which are not recognized by the personal dictionary are segmented by using the system dictionary, and thus the natural language input by the user is translated into the database query language.
The search system is provided with two word banks of a system dictionary and a personal dictionary, so that a user can timely, conveniently and quickly realize semantic accumulation in the personal dictionary according to the actual business requirement of the user, and corresponding words are accumulated for maintenance of the system dictionary by maintainers without depending on the system level, so that the speed of word sense accumulation is increased and the precision of word sense accumulation is also increased for each personal user. Meanwhile, because the two word banks coexist, even if different users accumulate the same vocabulary with different semantics, the specific users can not be prevented from using the accumulated semantics of the specific vocabulary. That is, the semantic accumulation in the personal dictionary is not limited by other users in the whole network, and meanwhile, the semantics accumulated in the personal dictionary does not influence other users in the whole network, so that the whole network cognition unification is not required to be forced for the word meaning accumulation.
In one embodiment of the invention, when more than a certain number of personal dictionaries have been accumulated for the same semantics, the system automatically synchronizes the accumulation of the same semantics to the system dictionaries. For example, if there are three or five users defining "profit" as a synonym for "net profit" in their respective personal dictionaries, the system automatically synchronizes this definition of "profit" as a synonym for "net profit" to the system dictionary. The three or five users are synchronization judgment thresholds which can be set reasonably according to the total number of users or the number of users defined by related vocabularies. Therefore, the semantic accumulation speed of the system dictionary is increased, and the semantics accumulated in the system dictionary better meet the use requirements of most users in the whole network through a threshold judgment rule, so that the accuracy of the semantic accumulation in the system dictionary is improved on the whole.
In one embodiment of the invention, a word segmentation method for searching structured data in natural language is provided, which comprises the word sense accumulation method; in the process of searching the structured data, the natural language input by the user is firstly segmented by using the personal dictionary, the words which are not recognized by the personal dictionary are segmented by using the system dictionary, and thus the natural language input by the user is translated into the database query language.
In an embodiment of the present invention, a word sense accumulation tool for searching structured data in natural language is further provided, which includes one or more of a synonym table definition device, a glossary definition device, a derivative calculation field table definition device, and a virtual dimension data definition device;
the synonym table definition device is used for creating or supplementing a synonym table;
the glossary definition device is used for newly building or supplementing a glossary;
the derivative calculation field table is used for creating or supplementing a derivative calculation field table;
the virtual dimension data definition device is used for defining virtual dimension data in the configuration library so as to update the dictionary; the virtual dimension data refers to dimension data which is not included by entities in the structured data serving as a search object, and the dimension data comprises a dimension name and/or a dimension value; in the configuration library, each type of entity has a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data is defined in the virtual dimension table.
As one embodiment, the virtual dimension data definition apparatus includes an unrecognized word entity determination unit, a virtual dimension table editing unit, and a virtual dimension table storage unit;
the unidentified word entity determining unit is used for determining the entity of the unidentified word when the unidentified word has the dimension attribute in the structured data;
the virtual dimension table editing unit is used for editing the virtual dimension table, if the unrecognized word is a dimension name, a dimension column named by the dimension name is newly built in the virtual dimension table, and a dimension value is assigned to the entity under the dimension name;
if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table;
and the virtual dimension table storage unit is used for storing the virtual dimension table in a configuration library.
In one embodiment of the invention, another word sense accumulation tool for searching structured data in natural language is provided, wherein the search system comprises a system dictionary, a personal dictionary and a dictionary generating device; the personal dictionary is a dictionary defined by a specific user, and corresponds to a specific user ID; the system dictionary is a dictionary shared by unspecified users; the virtual dimension data defined by the virtual dimension data definition means is updated into the personal dictionary by the dictionary generation means. The composition and the working process of the virtual dimension data definition device are as described in the foregoing embodiments, and are not described again.
In an embodiment of the present invention, a word segmentation system for searching structured data in natural language is provided, which includes the word sense accumulation tool described in the foregoing embodiments; in the process of searching the structured data, the natural language input by the user is firstly segmented by using the personal dictionary, the words which are not recognized by the personal dictionary are segmented by using the system dictionary, and thus the natural language input by the user is translated into the database query language.
The invention is further illustrated below by way of an example.
The user a searches for "net information difference is greater than 2% and the reject ratio is less than 1.5% for the agricultural business", and finds that "agricultural business" is not recognized as shown in fig. 5. The agricultural business is one of banks, correspondingly, national commercial banks, city businesses and the like, and the classification in the database is estimated to be not fine enough, so the problem is solved by a 'user-defined dimension' mode.
As shown in FIG. 6, a farm business is obviously a dimension value that should belong to the "third-tier industry" dimension of the "company" entity. Because the database has the dimension of 'three-level industry', the database can be directly selected, otherwise, the dimension needs to be newly established. Namely, the dimension of which kind of entity is determined, and the dimension to which the entity belongs is selected or a dimension is newly established.
As shown in fig. 7, default display of the abbreviation of the selected entity and all values under the selected dimension is performed, a screening function is provided, for example, all "third-level industries" are selected as "city businesses", then the "third-level industries" are found and should be subdivided as "agricultural businesses", the "third-level industries" of the companies are selected one by one, namely, a virtual dimension value "agricultural businesses" is automatically added to the "third-level industries", click confirmation is performed, and customization is completed. That is, a specific entity to be assigned the dimension value under the entity of the class is selected and stored in the personal dictionary.
As shown in fig. 8, the search is automatically re-triggered, and it can be seen that "the farm business" is understood. The "agricultural business" is recognized by the word segmentation of the personal dictionary, and other words are recognized by the word segmentation of the system dictionary.
As shown in fig. 9, continuing to investigate these agricultural businesses, a calculation formula "(target price-current price)/current price/adaptation rate" is added to the search box. After searching, the current price is not identified, and the current price is added as the synonym of the closing price in a self-defined synonym mode. I.e. a synonym.
As shown in fig. 10, after all words are recognized, since the entire search sentence is already long, the "net difference is greater than.. long.," is customized as the term "good property of loan", and "(target price-current price)/just.. long.," is customized as the calculation index "the value of buy" is selected in consideration of convenience later. Namely, the term is matched and stored in the personal dictionary, the derivative calculation field is matched and stored in the personal dictionary.
As shown in fig. 11, the user a then searches for "the value of the good-loan-property agriculture business in 3 seasons this year", and the search efficiency is much higher. For users other than a, however, loan assets, agriculture, and purchase value are still unrecognized words. When a user B and a user C exist, the term of ' good loan assets ' is also defined as the term, and the term is synchronized to the system dictionary when the definition is that the net information difference is larger than 2% and the reject ratio is smaller than 1.5% ', and the search for ' good loan assets ' is equivalent to the search for ' the net information difference is larger than 2% and the reject ratio is smaller than 1.5% ' for the users of the whole system. I.e., more than 3 people define the same semantics for the same word, is synchronized to the system dictionary.
The invention is further illustrated below by way of another example.
The user enters "royalty sponsor ROE > 15%" at the search engine input port and wants to find a qualified company, where ROE is an english abbreviation for net asset profitability. In the initial search result, "sponsor for blossoming" was not identified, i.e., "sponsor for blossoming" was not found in the dictionary; ROE identifies errors, which are understood to be the full name of a security "moran micro". Aiming at the unrecognized or recognized errors, the dictionary updating of the search engine can be realized by the following method, and the semantic recognition capability is improved.
Step one, after judging that the blossoming words are words with dimension attributes, a user selects the blossoming words in semantic analysis, and selects the user-defined dimensions in a popup menu.
And step two, determining that the entity type of the blossoming flower is 'company' instead of individual.
And step three, judging that the wonderful flower is a dimension name, but not a dimension value. Therefore, selecting 'newly establishing a dimension', wherein the dimension name is determined as 'advertisement putting target';
and step five, searching a plurality of companies which sponsor blossoming blossoms, such as Haitian flavour industry, Hailan home and Shanghai group, on the Internet. The companies are given the "lower assignment of the advertisement placement target" blossoming "and stored in the virtual dimension table.
And step six, triggering automatic re-search, and seeing that the wonderful saying is identified, and the sponsor and the ROE are not identified.
And step seven, clicking the 'sponsor', selecting the 'configuration synonym', and inputting the 'advertisement' into the configuration input box, namely setting the 'advertisement' as the synonym of the 'sponsor'. The association is matched to the virtual dimension 'advertising target' just stored, and selection is determined and stored in the synonym table.
Step eight, triggering automatic re-search, only remaining 'ROE' not correctly identified, and selecting configuration synonym
Step nine, inputting the net asset profitability into the configuration input box, namely setting the ROE as a synonym of the net asset profitability (a main word). Imagine "Net asset profitability-weighting" and "Net asset profitability-amortization", the latter is selected and saved in the synonym table.
Step ten, triggering automatic re-search, correctly identifying all word senses, and obtaining a search result, wherein the satisfied company is Haitian flavor and billow.
In this example, the step three-four passes through self-defining a virtual dimension "advertisement placement target", so that the problem that how to define a physical library does not exist, and the word (entity dimension) defined based on the existing data description cannot be defined is solved, and subsequently, more known targets for advertisement placement of known companies can be added under the dimension, and the data under the dimension is gradually improved.
In the step eight, the net asset profitability-spreading is defined as a main word of the ROE, most people probably consider the ROE to be the net asset profitability-weighting, and the corresponding relation exists in a personal dictionary of the user, so that other users cannot be interfered, and the compatibility problem of non-uniform cognition of the user is solved.
If the user who selects 'ROE and net asset profitability-amortization' as the synonym relation exceeds 3 people, the system automatically synchronizes the synonym relation to the system lexicon, so that all users think that the user searches 'net asset profitability-amortization' when searching for the ROE, and the problem that maintenance personnel must be relied on to process in time is solved. Of course, it is very likely that more people have selected "net asset profitability-weighting", then there are two senses of ROE for all people, and the word-ambiguous determination technique will not be described herein.
The invention automatically generates the relationship between tables by analyzing and comparing all the field values of the tables in the relational database; and displaying the inter-table relation of the database in a model suggestion table mode according to the obtained incidence relation. The invention aims to obtain the relationship between tables of the database through the data association analysis of the unknown database, so that a user can clearly know the table structure of the unknown database, and the use and the utilization of the database are facilitated. The display device displays the analysis information of the relationships among the tables and displays the relationships among the tables of the database in the form of the model suggestion table.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes described in a single embodiment or with reference to a single figure, for the purpose of streamlining the disclosure and aiding in the understanding of various aspects of the invention by those skilled in the art. However, the present invention should not be construed such that the features included in the exemplary embodiments are all the essential technical features of the patent claims.
Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.
It should be understood that the devices, modules, units, components, etc. included in the system of one embodiment of the present invention may be adaptively changed to be provided in an apparatus or system different from that of the embodiment. The different devices, modules, units or components comprised by the system of an embodiment may be combined into one device, module, unit or component or may be divided into a plurality of sub-devices, sub-modules, sub-units or sub-components.
The means, modules, units or components in the embodiments of the present invention may be implemented in hardware, or may be implemented in software running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that embodiments in accordance with the present invention may be practiced using a microprocessor or Digital Signal Processor (DSP). The present invention may also be embodied as a computer program product or computer-readable medium for performing a portion or all of the methods described herein.

Claims (16)

1. A word sense accumulation method for structured data searched by natural language is characterized in that virtual dimension data are defined in a configuration library, and a dictionary is updated to be used for segmenting a keyword string submitted by a user; the virtual dimension data refers to dimension data which is not included by an entity in the structured data serving as a search object, and the dimension data comprises a dimension name and a dimension value; wherein,
in the configuration library, each type of entity has a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data is defined in the virtual dimension table.
2. The method for word sense accumulation for structured data oriented to natural language searches of claim 1 in which the method for defining the virtual dimension data is:
when the unrecognized word has the dimension attribute in the structured data, determining an entity corresponding to the unrecognized word;
if the unidentified word is a dimension name, establishing a dimension column named by the dimension name in a virtual dimension table, and assigning a dimension value to the entity under the dimension name;
and if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table.
3. The word sense accumulation method for structured data oriented natural language search of claim 1 or 2 wherein the search system comprises a system dictionary and a personal dictionary; wherein,
the personal dictionary is a dictionary defined by a specific user, and corresponds to the ID of the specific user;
the system dictionary is a dictionary shared by unspecified users;
the defined virtual dimension data is updated to the personal dictionary.
4. The method of claim 3, wherein word senses are defined in the configuration repository and updated to the personal dictionary by one or more of the following methods:
the first method is as follows: creating or supplementing a synonym table;
the second method comprises the following steps: newly creating or supplementing a glossary;
the third method comprises the following steps: and newly creating or supplementing a derivative calculation field table.
5. A word sense accumulation method for structured data of natural language search is characterized in that a search system comprises a system dictionary and a personal dictionary; wherein,
the personal dictionary is a dictionary defined by a specific user, and corresponds to the ID of the specific user;
the system dictionary is a dictionary shared by unspecified users;
defining word senses in the configuration library and updating to the personal dictionary in one or more of the following ways:
the first method is as follows: creating or supplementing a synonym table;
the second method comprises the following steps: newly creating or supplementing a glossary;
the third method comprises the following steps: newly building or supplementing a derivative calculation field table;
the method is as follows: defining virtual dimension data; the virtual dimension data refers to dimension data which is not included by entities in the structured data serving as a search object, and the dimension data comprises a dimension name and/or a dimension value; wherein,
in the configuration library, each type of entity has a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data is defined in the virtual dimension table.
6. The method for word sense accumulation for structured data oriented to natural language searches of claim 5 wherein the method for defining the virtual dimension data is:
when the unrecognized word has the dimension attribute in the structured data, determining an entity corresponding to the unrecognized word;
if the unidentified word is a dimension name, establishing a dimension column named by the dimension name in a virtual dimension table, and assigning a dimension value to the entity under the dimension name;
and if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table.
7. A word segmentation method for searching structured data in natural language, which is characterized by comprising the word sense accumulation method of any one of claims 3 to 6;
in the process of searching the structured data, the natural language input by the user is firstly segmented by using the personal dictionary, the words which are not recognized by the personal dictionary are segmented by using the system dictionary, and thus the natural language input by the user is translated into the database query language.
8. A word segmentation method for searching structured data in natural language, which is characterized by comprising the word sense accumulation method of any one of claims 3 to 6;
when more than N personal dictionaries will define the same word sense for the same word, the word sense of the word is synchronized from the herringbone dictionary to the system dictionary.
9. A word sense accumulation tool for searching structured data in natural language is characterized by comprising a virtual dimension data definition device, a word sense accumulation module and a word sense search module, wherein the virtual dimension data definition device is used for defining virtual dimension data in a configuration library and further updating a dictionary; the virtual dimension data refers to dimension data which is not included by an entity in the structured data serving as a search object, and the dimension data comprises a dimension name and a dimension value; wherein,
in the configuration library, each type of entity has a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data is defined in the virtual dimension table.
10. The word sense accumulation tool for structured data oriented natural language search of claim 9 wherein the virtual dimension data definition means comprises an unrecognized word entity determining unit, a virtual dimension table editing unit, and a virtual dimension table storage unit;
the unidentified word entity determining unit is used for determining the entity of the unidentified word when the unidentified word has the dimension attribute in the structured data;
the virtual dimension table editing unit is used for editing the virtual dimension table, if the unrecognized word is a dimension name, a dimension column named by the dimension name is newly built in the virtual dimension table, and a dimension value is assigned to the entity under the dimension name;
if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table;
and the virtual dimension table storage unit is used for storing the virtual dimension table in a configuration library.
11. The word sense integration tool for natural language search structured data oriented according to claim 9 or 10, wherein the search system comprises a system dictionary, a personal dictionary, and dictionary generating means; wherein,
the personal dictionary is a dictionary defined by a specific user, and corresponds to the ID of the specific user;
the system dictionary is a dictionary shared by unspecified users;
the dictionary generating means updates the defined virtual dimension data to the personal dictionary.
12. The word sense accumulation tool for structured data oriented natural language searches of claim 11 further comprising one or more of a synonym table definition means, a glossary definition means, and a derived computation field table definition means;
the synonym table definition device is used for creating or supplementing a synonym table;
the glossary definition device is used for newly building or supplementing a glossary;
the derivative calculation field table is used for creating or supplementing a derivative calculation field table;
the newly created or supplemented synonym table, the glossary, or the derived calculation field table is updated to the personal dictionary by the dictionary creating means.
13. A word sense accumulation tool for searching structured data in natural language is characterized in that a search system comprises a system dictionary, a personal dictionary and a dictionary generating device; wherein,
the personal dictionary is a dictionary defined by a specific user, and corresponds to the ID of the specific user;
the system dictionary is a dictionary shared by unspecified users;
the system also comprises one or more of a synonym table definition device, a term table definition device, a derivative calculation field table definition device and a virtual dimension data definition device; wherein,
the synonym table definition device is used for creating or supplementing a synonym table;
the glossary definition device is used for newly building or supplementing a glossary;
the derivative calculation field table is used for creating or supplementing a derivative calculation field table;
the virtual dimension data definition device is used for defining virtual dimension data in the configuration library; the virtual dimension data refers to dimension data which is not included by an entity in the structured data serving as a search object, and the dimension data comprises a dimension name and a dimension value; wherein,
in a configuration library, each type of entity is provided with a virtual dimension table, the virtual dimension table comprises an entity name column and a dimension column, and virtual dimension data are defined in the virtual dimension table;
the newly created or supplemented synonym table, glossary or derivative calculation field table, or the defined virtual dimension data is updated to the personal dictionary by the dictionary generating device.
14. The word sense accumulation tool for structured data oriented natural language search of claim 13 wherein the virtual dimension data definition means comprises an unrecognized word entity determining unit, a virtual dimension table editing unit, and a virtual dimension table storage unit;
the unidentified word entity determining unit is used for determining the entity of the unidentified word when the unidentified word has the dimension attribute in the structured data;
the virtual dimension table editing unit is used for editing the virtual dimension table, if the unrecognized word is a dimension name, a dimension column named by the dimension name is newly built in the virtual dimension table, and a dimension value is assigned to the entity under the dimension name;
if the unrecognized word is the dimension value, determining the dimension name corresponding to the dimension value, and then adding the unrecognized word serving as the dimension value to the dimension name of the entity name meeting the condition in the virtual dimension table;
and the virtual dimension table storage unit is used for storing the virtual dimension table in a configuration library.
15. A word sense accumulation tool for natural language search structured data, comprising the word sense accumulation tool of any one of claims 11-14;
when more than N personal dictionaries will define the same word sense for the same word, the word sense of the word is synchronized from the herringbone dictionary to the system dictionary.
16. A word segmentation system for searching structured data in natural language, comprising the word sense accumulation tool of claim 15;
in the process of searching the structured data, the natural language input by the user is firstly segmented by using the personal dictionary, the words which are not recognized by the personal dictionary are segmented by using the system dictionary, and thus the natural language input by the user is translated into the database query language.
CN201911372759.1A 2019-12-27 2019-12-27 Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language Pending CN113051898A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911372759.1A CN113051898A (en) 2019-12-27 2019-12-27 Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911372759.1A CN113051898A (en) 2019-12-27 2019-12-27 Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language

Publications (1)

Publication Number Publication Date
CN113051898A true CN113051898A (en) 2021-06-29

Family

ID=76506062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911372759.1A Pending CN113051898A (en) 2019-12-27 2019-12-27 Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language

Country Status (1)

Country Link
CN (1) CN113051898A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138945A (en) * 2022-01-19 2022-03-04 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1871603A (en) * 2003-08-21 2006-11-29 伊迪利亚公司 System and method for processing a query
US20070162481A1 (en) * 2006-01-10 2007-07-12 Millett Ronald P Pattern index
US7284011B1 (en) * 2004-12-28 2007-10-16 Emc Corporation System and methods for processing a multidimensional database
WO2009010358A1 (en) * 2007-07-18 2009-01-22 Siemens Aktiengesellschaft Method for voice recognition
TW201117028A (en) * 2009-11-12 2011-05-16 Inventec Corp Electronic dictionary querying system with multi-association words feedback and method thereof
CN102640145A (en) * 2009-08-31 2012-08-15 莱克萨利德股份公司 Trusted query system and method
CN106156304A (en) * 2016-07-01 2016-11-23 中国南方电网有限责任公司 A kind of data retrieval for power system and sort method
CN106372194A (en) * 2016-08-31 2017-02-01 杭州追灿科技有限公司 Method and system for showing search results
CN108198083A (en) * 2018-01-12 2018-06-22 平安科技(深圳)有限公司 Declaration form multi dimensional analysis implementation method, device, terminal device and storage medium
CN109165377A (en) * 2018-06-11 2019-01-08 玖富金科控股集团有限责任公司 Generate the method and tabulating equipment of form data
CN109657062A (en) * 2018-12-24 2019-04-19 万达信息股份有限公司 A kind of electronic health record text resolution closed-loop policy based on big data technology
CN110211584A (en) * 2019-06-04 2019-09-06 广州小鹏汽车科技有限公司 Control method for vehicle, device, storage medium and controlling terminal
CN110276079A (en) * 2019-06-27 2019-09-24 谷晓佳 A kind of dictionary method for building up, information retrieval method and corresponding system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1871603A (en) * 2003-08-21 2006-11-29 伊迪利亚公司 System and method for processing a query
US7284011B1 (en) * 2004-12-28 2007-10-16 Emc Corporation System and methods for processing a multidimensional database
US20070162481A1 (en) * 2006-01-10 2007-07-12 Millett Ronald P Pattern index
WO2009010358A1 (en) * 2007-07-18 2009-01-22 Siemens Aktiengesellschaft Method for voice recognition
CN102640145A (en) * 2009-08-31 2012-08-15 莱克萨利德股份公司 Trusted query system and method
TW201117028A (en) * 2009-11-12 2011-05-16 Inventec Corp Electronic dictionary querying system with multi-association words feedback and method thereof
CN106156304A (en) * 2016-07-01 2016-11-23 中国南方电网有限责任公司 A kind of data retrieval for power system and sort method
CN106372194A (en) * 2016-08-31 2017-02-01 杭州追灿科技有限公司 Method and system for showing search results
CN108198083A (en) * 2018-01-12 2018-06-22 平安科技(深圳)有限公司 Declaration form multi dimensional analysis implementation method, device, terminal device and storage medium
CN109165377A (en) * 2018-06-11 2019-01-08 玖富金科控股集团有限责任公司 Generate the method and tabulating equipment of form data
CN109657062A (en) * 2018-12-24 2019-04-19 万达信息股份有限公司 A kind of electronic health record text resolution closed-loop policy based on big data technology
CN110211584A (en) * 2019-06-04 2019-09-06 广州小鹏汽车科技有限公司 Control method for vehicle, device, storage medium and controlling terminal
CN110276079A (en) * 2019-06-27 2019-09-24 谷晓佳 A kind of dictionary method for building up, information retrieval method and corresponding system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138945A (en) * 2022-01-19 2022-03-04 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis
CN114138945B (en) * 2022-01-19 2022-06-14 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis

Similar Documents

Publication Publication Date Title
US10754883B1 (en) System and method for insight automation from social data
KR101231560B1 (en) Method and system for discovery and modification of data clusters and synonyms
US9734192B2 (en) Producing sentiment-aware results from a search query
US20090300043A1 (en) Text based schema discovery and information extraction
CN108038725A (en) A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN109960756A (en) Media event information inductive method
CN111324631B (en) Method for automatically generating sql statement by human natural language of query data
CN109446313B (en) Sequencing system and method based on natural language analysis
CN111159381B (en) Data searching method and device
US20200073890A1 (en) Intelligent search platforms
EP1776666A2 (en) Active relationship management
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
CN111104803A (en) Semantic understanding processing method, device and equipment and readable storage medium
CN112445894A (en) Business intelligent system based on artificial intelligence and analysis method thereof
CN113946686A (en) Electric power marketing knowledge map construction method and system
CN118096452B (en) Case auxiliary judgment method, device, terminal equipment and medium
CN111241299A (en) Knowledge graph automatic construction method for legal consultation and retrieval system thereof
CN117272073B (en) Text unit semantic distance pre-calculation method and device, and query method and device
CN113051898A (en) Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language
CN113127650A (en) Technical map construction method and system based on map database
CN112183110A (en) Artificial intelligence data application system and application method based on data center
CN112183037A (en) Data classification and summarization method and system in parallel enterprise finance and tax SaaS system
US11922326B2 (en) Data management suggestions from knowledge graph actions
CN114780601A (en) Data query method and device, electronic equipment and storage medium
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination