CN114138945A - Entity identification method and device in data analysis - Google Patents

Entity identification method and device in data analysis Download PDF

Info

Publication number
CN114138945A
CN114138945A CN202210058350.8A CN202210058350A CN114138945A CN 114138945 A CN114138945 A CN 114138945A CN 202210058350 A CN202210058350 A CN 202210058350A CN 114138945 A CN114138945 A CN 114138945A
Authority
CN
China
Prior art keywords
matching
entity
dictionary
natural language
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210058350.8A
Other languages
Chinese (zh)
Other versions
CN114138945B (en
Inventor
田有朋
刘海波
李俊
黄亚东
王小卫
朱文嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210058350.8A priority Critical patent/CN114138945B/en
Publication of CN114138945A publication Critical patent/CN114138945A/en
Application granted granted Critical
Publication of CN114138945B publication Critical patent/CN114138945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

An embodiment of the specification provides an entity identification method and device in data analysis, and the method comprises the following steps: acquiring a natural language text input by a user, wherein the natural language text is used for expressing the data analysis requirement of the user on target data; acquiring entity dictionaries of multiple categories constructed based on a historical natural language corpus set and the target data, wherein the multiple categories are related to data dimensions and/or data analysis intents of the target data; and executing matching processing of the characters and words included in the entity dictionaries of the plurality of categories for the characters included in the natural language text, and taking a matching result as an entity in the identified corresponding category. Can meet the requirements of accuracy and interpretability in data analysis.

Description

Entity identification method and device in data analysis
Technical Field
One or more embodiments of the present description relate to the field of computers, and more particularly, to a method and apparatus for entity identification in data analysis.
Background
Currently, the data analysis requirement for a database has the characteristics of flexibility and a large amount, and when one data analysis requirement is met, a professional needs to convert the data analysis requirement into a Structured Query Language (SQL) statement which can be understood by a computer, and then the computer can perform corresponding data analysis on the database by executing the SQL statement.
Due to the limited number of professionals, for the data analysis requirements of a large number of non-professionals, the data analysis requirements are usually converted into corresponding SQL statements by means of the professionals, and the process usually needs to wait for a long time and cannot quickly meet the data analysis requirements. It is therefore desirable that a computer be able to receive user input of natural language text for expressing its data analysis needs by performing entity recognition on the natural language text, thereby understanding its data analysis needs based on the recognized entities.
In the field of data analysis, the requirement for the result of data analysis is 100% accurate, correspondingly, the requirement for entity identification in data analysis is 100% accurate, and the identification result is required to have interpretability, and the entity identification methods in the prior art cannot meet the requirements for accuracy and interpretability in data analysis.
Disclosure of Invention
One or more embodiments of the present specification describe a method and apparatus for entity identification in data analysis that can meet the requirements of accuracy and interpretability in data analysis.
In a first aspect, a method for identifying an entity in data analysis is provided, and the method includes:
acquiring a natural language text input by a user, wherein the natural language text is used for expressing the data analysis requirement of the user on target data;
acquiring entity dictionaries of multiple categories constructed based on a historical natural language corpus set and the target data, wherein the multiple categories are related to data dimensions and/or data analysis intents of the target data;
and executing matching processing of the characters and words included in the entity dictionaries of the plurality of categories for the characters included in the natural language text, and taking a matching result as an entity in the identified corresponding category.
In one possible embodiment, the data analysis requirement includes querying a first range of the target data, and performing a first-way statistical analysis on the first range of the target data.
In one possible embodiment, the obtaining a plurality of categories of entity dictionaries constructed based on a set of historical natural language corpora and the target data includes:
acquiring a global dictionary constructed based on a historical natural language corpus set;
acquiring a proprietary dictionary constructed based on metadata information and data information of a target database to which the target data belongs; the global dictionary and the proprietary dictionary together constitute an entity dictionary for the plurality of categories.
In one possible embodiment, the plurality of categories include at least one of a time category, a unit category, an intention category, a dimension value category; the dimension category corresponds to a field name in a target database to which the target data belongs, and the dimension value category corresponds to a specific value of a field in the target database.
Further, each word in the proprietary dictionary is stored in a triplet form that includes a name of a data table, a category name, and a field name.
In one possible embodiment, the performing of matching the character with a word included in the entity dictionary of the plurality of categories includes:
sequentially executing matching processing of the current character and the words in the entity dictionaries of the multiple categories in sequence in a multi-round iteration mode; and in each iteration, matching the current character with the words in the entity dictionary, if the matching is successful, ending the iteration of the current round, if the matching is unsuccessful, combining the current character with the next character, matching the combined character string with the words in the entity dictionary, and ending the iteration of the current round until the matching is successful.
Further, the matching the combined character string with the words included in the entity dictionary includes:
if the combined character string is completely consistent with the target word in the entity dictionary, confirming that the target word is an accurate matching result of the character string;
if the combined character string is consistent with a target word part in an entity dictionary and the character string belongs to a prefix part of the target word, confirming that the target word is a prefix matching result of the character string; and if the character string has an accurate matching result and also has a prefix matching result, selecting the accurate matching result as a final matching result.
Further, in each iteration, before matching the current character with a word included in the entity dictionary, the method further includes:
judging whether continuous numbers exist in the natural language text or not;
and if the continuous numbers exist, processing the continuous numbers as a single character, using the continuous numbers as the current character, and executing the matching of the current character and the words in the entity dictionary.
Further, said matching the current character with a word included in an entity dictionary is performed with the consecutive number as the current character, including
If the continuous number comprises Chinese numbers and is provided with Chinese units, converting the Chinese numbers in the continuous number into Arabic numbers;
the Arabic numerals are combined with the Chinese units and then matched with words in the entity dictionary.
Further, the matching with the words included in the entity dictionary after combining the arabic numerals and the chinese units includes:
combining Arabic numerals and Chinese units, and performing digital generalization processing to obtain a first generalization result so as to ignore the influence of specific numerals;
and matching the first generalization result with words included in the entity dictionary.
Further, the method further comprises:
if the matching result shows the entity of the time category corresponding to the continuous number, judging whether the time high order of the matching result is complete;
and if the time high order of the matching result is judged to be incomplete, the time high order of the matching result is filled according to the current time.
In a second aspect, an entity identification apparatus in data analysis is provided, the apparatus comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a natural language text input by a user, and the natural language text is used for expressing the data analysis requirement of the user on target data;
a second obtaining unit, configured to obtain entity dictionaries of multiple categories, which are constructed based on a historical natural language corpus set and the target data, and are related to data dimensions and/or data analysis intents of the target data;
and a matching unit configured to perform matching processing of characters and words included in the entity dictionaries of the plurality of categories acquired by the second acquisition unit for the characters included in the natural language text acquired by the first acquisition unit, and take a matching result as an entity in the identified corresponding category.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, firstly, a natural language text input by a user is obtained, and the natural language text is used for expressing the data analysis requirement of the user on target data; then acquiring entity dictionaries of multiple categories constructed based on a historical natural language corpus set and the target data, wherein the multiple categories are related to data dimensions and/or data analysis intents of the target data; and finally, aiming at the characters included in the natural language text, executing matching processing of the characters and words included in the entity dictionaries of the multiple categories, and taking a matching result as the entity in the identified corresponding category. As can be seen from the above, in the embodiments of the present specification, in the face of a natural language text input by a user, instead of performing entity recognition by using a normal deep learning method, for characters included in the natural language text, a matching process between the characters and words included in entity dictionaries of multiple categories is performed, and a matching result is used as an entity in a corresponding category that is recognized, where the entity dictionaries of the multiple categories are constructed based on a historical natural language corpus and the target data, and the multiple categories are related to data dimensions and/or data analysis intents of the target data, so that understanding of data analysis requirements of the user by using the recognized entity and the category thereof is facilitated, and the recognition process can meet requirements of accuracy and interpretability in data analysis.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 is a schematic diagram illustrating an implementation scenario of another embodiment disclosed in the present specification;
FIG. 3 illustrates a flow diagram of a method for entity identification in data analysis, according to one embodiment;
FIG. 4 illustrates an abstract process diagram of a entity dictionary, according to one embodiment;
FIG. 5 illustrates a process diagram for building an entity dictionary, according to one embodiment;
FIG. 6 illustrates a process diagram for building a global dictionary and a proprietary dictionary, according to one embodiment;
FIG. 7 illustrates a matching process diagram according to one embodiment;
FIG. 8 shows a schematic diagram of a matching process according to another embodiment;
FIG. 9 shows a schematic block diagram of an entity recognition arrangement in data analysis according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. This implementation scenario involves entity identification in data analysis. And (3) data analysis, namely analyzing a large amount of collected data by using a proper statistical analysis method, and summarizing, understanding and digesting the data so as to maximally develop the function of the data and play the role of the data. Data analysis is the process of studying and summarizing data in detail to extract useful information and to form conclusions. In the embodiment of the present specification, a storage manner of data to be analyzed is not limited, and other storage manners such as an excel table may also be adopted. The database comprises a plurality of data tables, each data table comprises a plurality of fields, the fields correspond to columns, and each field is provided with a corresponding field name and a column of field values of a corresponding column. Referring to fig. 1, in order to quickly meet the data analysis requirement of a user, an embodiment of the present specification proposes a solution that enables a computer to receive a natural language text input by the user, and performs entity recognition on the natural language text, so as to understand the data analysis requirement based on a recognized entity. For example, the user inputs a segment of natural language text "transaction amount of men in each city yesterday", it is understood that the user may input the segment of natural language text through characters, or input voice by the user, and convert the voice into the segment of natural language text, where the segment of natural language text is used for expressing the data analysis requirement of the user on the target data. Taking the example that the data to be analyzed is stored in the target database, a plurality of entities are obtained by entity recognition on the segment of natural language text, wherein the method comprises the following steps: the entity of the time category "yesterday (2020-09-04)", "2020-09-04", i.e. 9, and 4 days in 2020, represents the specific date corresponding to "yesterday"; "City" corresponds to a particular field name in the target database; "Male (sex)" wherein "male" corresponds to a specific field value in the target database with the field name "sex"; the "transaction amount" corresponds to a particular field name in the target database. According to the identified entities, the computer can understand the data analysis requirements of the user, and then give data analysis results.
The embodiment of the specification can meet the natural language query aiming at the database, and the data query and analysis are carried out on the database through the natural language. The natural language query is implemented based on entity recognition, i.e. recognizing entities with specific meanings in natural language text, and converting character sequences into entity sequences, such as time and other entities.
The data analysis result in the embodiment of the present specification may have various presentation forms, which may include, but are not limited to, text, diagrams, voice, and other presentation forms, for example, fig. 1 shows "transaction amount of male in each city yesterday" in a form of a pie chart, where the 4 presented cities are beijing, nanjing, guangzhou, and the pie chart includes 4 sector areas, each sector area corresponds to a transaction amount of one city, the transaction amount corresponding to beijing is 2.00, the transaction amount corresponding to nanjing is 8.00, the transaction amount corresponding to guangzhou is 6.00, the transaction amount corresponding to hangzhou is 6.00, and the area size of each sector area visually reflects the size of the transaction amount of each city.
The entity recognition method provided by the embodiment of the specification can meet the requirements of the data analysis field on two aspects of accuracy and interpretability by constructing a plurality of categories of entity dictionaries and matching each character in the natural language text input by the user with each word in the entity dictionaries.
Fig. 2 is a schematic view of an implementation scenario of another embodiment disclosed in this specification. This implementation scenario involves entity identification in data analysis. Referring to FIG. 2, the user enters a natural language text "the top ten of the payment amount in Beijing in the last thirty days", and the target database includes the following field names user, city, amt, time. Two values are included under the user field, namely 001 and 002; two values are included under the city field, namely Beijing city and Hangzhou city; the amt field comprises two values which are respectively 20 and 10; two values are included under the time field, 20200521 and 20200522 respectively. It will be appreciated that databases typically store large amounts of data, and that the figures are merely exemplary in nature to depict portions of the target database. After entity identification, obtaining 4 entities which are 0501-0530 respectively, wherein the category corresponding to the entity is Time, namely the Time category; in Beijing, the type corresponding to the entity is Col _ Value, namely dimension type, and the city represents the corresponding field name; paying amount, wherein the category corresponding to the entity is Measure, namely the dimension category, and amt represents the corresponding field name; top (10, desc), the category corresponding to the entity is Intent, i.e. intention category, which represents descending order of the Top 10 bits. In the embodiment of the specification, after the entity recognition is performed on the natural language text, the recognized entities all have corresponding categories, and the categories are helpful for reflecting the data analysis requirements of the entities.
Fig. 3 shows a flow diagram of an entity identification method in data analysis according to an embodiment, which may be based on the implementation scenarios shown in fig. 1 or fig. 2. As shown in fig. 3, the entity identification method in data analysis in this embodiment includes the following steps: step 31, acquiring a natural language text input by a user, wherein the natural language text is used for expressing the data analysis requirement of the user on target data; step 32, acquiring entity dictionaries of multiple categories constructed based on a historical natural language corpus set and the target data, wherein the multiple categories are related to data dimensions and/or data analysis intents of the target data; and step 33, aiming at the characters included in the natural language text, executing matching processing of the characters and words included in the entity dictionaries of the plurality of categories, and taking the matching result as the entity in the identified corresponding category. Specific execution modes of the above steps are described below.
First, in step 31, a natural language text input by a user is obtained, and the natural language text is used for expressing the data analysis requirement of the user on target data. It can be understood that the target data may be stored in any storage manner, and when the target data is stored in the database, since different databases generally have different field names and field values, the data analysis requirements faced are different accordingly. For example, the first database has field names including name, age, sex, identification number, and school calendar, and the second database has field names including user number, sex, and transaction amount, which are different fields, and thus are generally subject to different data analysis requirements.
In one example, the data analysis requirements include querying a first range of the target data and performing a first manner of statistical analysis on the first range of the target data.
It is understood that a small range of data to be analyzed can be determined from a large range of stored data by determining the data analysis requirement, for example, the target data is stored in a target database, the target database comprises a plurality of data tables, each data table comprises a plurality of fields, at least one data table can be selected from the plurality of data tables, and the data of at least one field can be selected from each data table in the at least one data table for analysis. In addition, there are various ways of statistical analysis, such as sorting, summing, averaging, etc., and one or more specific ways of statistical analysis may be determined by determining the data analysis requirements.
Referring to the implementation scenario shown in fig. 1, the natural language text input by the user is "yesterday trading amount for each city male", and the embodied data analysis requirements include: and searching the personal transaction amount of each sex male on a specific date represented by yesterday from the target database, and accumulating the personal transaction amounts of the genders male in the same city by taking the city as a dimension, so as to obtain the total transaction amounts of the males corresponding to the cities respectively, thereby facilitating the comparative analysis of the total transaction amounts of the cities.
Then, in step 32, an entity dictionary of a plurality of categories constructed based on the set of historical natural language corpora and the target data is obtained, the plurality of categories being related to data dimensions and/or data analysis intents of the target data. It can be understood that, taking the example that the target data is stored in the target database, for the data analysis requirements of different databases, there are entities in both common and individual aspects, that is, the entities included in the entity dictionary, the entities in common aspect can be obtained by analyzing the set of the historical natural language corpus, and the entities in individual aspect can be obtained by analyzing the target database.
FIG. 4 illustrates an abstract process diagram of an entity dictionary, according to one embodiment. Referring to fig. 4, taking a piece of natural language text as an example of "trend of each payment channel of last three days", the piece of natural language text covers three types of expressions, wherein "last three days" corresponds to a time range expression, "each payment channel" corresponds to a dimension grouping expression, "trend" corresponds to a query intention expression, each type of expression above can be abstracted into a dictionary of at least one category, for example, the time range expression comprises time single-point expressions like yesterday and fifties, and also comprises time span expressions like last year, last week and numbers 1 to 5, and by analyzing regularity of the time range expressions, an entity dictionary of the time category can be constructed; for another example, the dimension grouping expression may include dimensions like gender, city, channel, and the like, and any combination thereof, and by analyzing regularity of the dimension grouping expression, an entity dictionary of the dimension category may be constructed, where the dimension corresponds to a field name of the database; for another example, the query intent expression may include expressions for aggregation logic such as similar trend, summary, ranking, proportion, ring ratio, and the like, and by analyzing regularity of the query intent expressions, an entity dictionary of the intent category may be constructed.
FIG. 5 shows a diagram of a construction process of an entity dictionary, according to one embodiment. Referring to fig. 5, the establishment of the entity dictionary of multiple categories based on the historical natural language corpus set and the target database can be completed at one time, that is, the establishment of the entity dictionary is completed by taking the corpora included in the historical natural language corpus set and the metadata information (meta) and the data information (data) corresponding to the target database as input data and through several steps of data acquisition, entity relation processing and dictionary establishment. It will be appreciated that entity relationship processing is actually the process of abstracting the raw data obtained, and dictionary construction assigns corresponding categories to the abstracted entities or words.
The example of this specification may also complete the construction of the entity dictionary in two stages according to different sources of the input data, where one stage constructs the entity dictionary according to the set of the historical natural language corpus, and the other stage constructs the entity dictionary according to the target data.
In one example, the obtaining a plurality of categories of entity dictionaries constructed based on a set of historical natural language corpora and the target data includes:
acquiring a global dictionary constructed based on a historical natural language corpus set;
acquiring a proprietary dictionary constructed based on metadata information and data information of a target database to which the target data belongs; the global dictionary and the proprietary dictionary together constitute an entity dictionary for the plurality of categories.
In this example, a large number of natural language corpora included in the historical natural language corpus set are used, the natural language corpora are used for the purpose of querying or data analysis on the database, but the targeted database is not necessarily the target database, and by analyzing the natural language corpora and figuring out the regularity of the natural language corpora, an appropriate entity dictionary of a corresponding category, that is, a global dictionary, can be constructed. In addition, by analyzing the metadata information and the data information of the target database and figuring out the regularity of the metadata information and the data information, an appropriate entity dictionary of a corresponding category, namely a special dictionary can be constructed. It will be appreciated that the global dictionary and the proprietary dictionary, which are complementary to each other, represent the data dimensions and/or data analysis intent of the target data.
FIG. 6 illustrates a process diagram for building a global dictionary and a proprietary dictionary, according to one embodiment. Referring to fig. 6, a global dictionary may be obtained by a series of processes such as pattern extraction, transformation, generalization, etc. using a large number of natural language corpora included in a set of historical natural language corpora as input data, and the global dictionary may correspond to at least one category of entity dictionary, for example, the aforementioned time category of entity dictionary, and may include, in addition to the unit category of entity dictionary, the symbol category of entity dictionary, the function category of entity dictionary, and the like. The method comprises the steps of taking metadata information (meta) and data information (data) corresponding to a target database as input data, and aiming at the metadata information, through the steps of processing the metadata information, constructing entity relations, locally loading and the like, constructing an entity dictionary of a dimension category, namely a dimension dictionary for short; aiming at data information, an entity dictionary of a dimension value category, namely a dimension value dictionary for short, can be constructed through a plurality of steps of automatic ETL (extraction-conversion-loading), entity relation construction, remote synchronization and the like, and it can be understood that the dimension value category corresponds to a specific value of a field.
In one example, the plurality of categories include at least one of a time category, a unit category, an intent category, a dimension value category; the dimension category corresponds to a field name in a target database to which the target data belongs, and the dimension value category corresponds to a specific value of a field in the target database.
Further, each word in the proprietary dictionary is stored in a triplet form that includes a name of a data table, a category name, and a field name.
In the embodiment of the description, the entity dictionaries of multiple categories can be automatically constructed by a machine, and after the data information and the meta information of the underlying data are changed, the entity dictionaries of multiple categories can be constructed in real time, so that millisecond-level starting and updating are realized, and the efficiency is high.
Finally, in step 33, for the characters included in the natural language text, matching processing of the characters and the words included in the entity dictionaries of the plurality of categories is performed, and the matching result is taken as the entity in the identified corresponding category. It is understood that the matching success includes two kinds, one is that the characters or the character combinations included in the natural language text completely coincide with the words included in the entity dictionary, and the other is that the characters or the character combinations included in the natural language text partially coincide with the words included in the entity dictionary.
FIG. 7 illustrates a matching process diagram according to one embodiment. Referring to fig. 7, the user inputs "beijing", one word included in the entity dictionary is "beijing city", which is stored in the form of "beijing city, city", it is understood that "beijing city" belongs to a dimension value, and "city" is a dimension corresponding to the dimension value or is referred to as a field name, and the other word included in the entity dictionary is "male", which is stored in the form of "male, sex", it is understood that "male" belongs to the dimension value, and "sex" is a dimension corresponding to the dimension value or is referred to as the field name. In the embodiment of the present description, words matched with user input can be searched according to a tree structure, starting from a root node, north is first matched, then "beijing" is matched, and finally "beijing city" is matched.
In one example, the performing of the matching process of the characters with the words included in the entity dictionaries of the plurality of categories includes:
sequentially executing matching processing of the current character and the words in the entity dictionaries of the multiple categories in sequence in a multi-round iteration mode; and in each iteration, matching the current character with the words in the entity dictionary, if the matching is successful, ending the iteration of the current round, if the matching is unsuccessful, combining the current character with the next character, matching the combined character string with the words in the entity dictionary, and ending the iteration of the current round until the matching is successful.
This example is a concrete manner of the matching process, and it may not necessarily be sequentially performed in order when actually performing the matching process of the characters with the words included in the entity dictionaries of the plurality of categories.
Further, the matching the combined character string with the words included in the entity dictionary includes:
if the combined character string is completely consistent with the target word in the entity dictionary, confirming that the target word is an accurate matching result of the character string;
if the combined character string is consistent with a target word part in an entity dictionary and the character string belongs to a prefix part of the target word, confirming that the target word is a prefix matching result of the character string; and if the character string has an accurate matching result and also has a prefix matching result, selecting the accurate matching result as a final matching result.
Further, in each iteration, before matching the current character with a word included in the entity dictionary, the method further includes:
judging whether continuous numbers exist in the natural language text or not;
and if the continuous numbers exist, processing the continuous numbers as a single character, using the continuous numbers as the current character, and executing the matching of the current character and the words in the entity dictionary.
Fig. 8 shows a schematic diagram of a matching process according to another embodiment. Referring to fig. 8, first, a variable start is assigned with 0, and then it is determined whether the start is smaller than query.length, which represents the number of characters included in a query sentence, which is a natural language text input by a user, if it is determined that the start is smaller than query.length, continuous numbers are processed as single characters, it is determined whether a current character is a number, if it is determined that the current character is not a number, an exact matching is performed, after the exact matching and when it is determined that the current character is a number, a next character is combined to perform prefix matching, it is determined whether the prefix matching is successful, if the prefix matching is not successful, a next character is continuously combined to perform prefix matching, if the prefix matching is successful, a candidate word is processed in a current round, an exact matching result is preferentially taken, a start value is reset to skip the matched character in the current round, and it is again determined that the start is smaller than query.length, if it is determined that the start is smaller than query.length, the same process flow as described above is performed, and if the start is judged to be not less than query. For example, in exact matching, a word is truncated, e.g., [ male ] is truncated to [ male ], and [ Beijing ] misses exact matches and prefixes hit [ Beijing City ] are filled-in. In addition, if a plurality of words are matched, all the words may be used as matching results, or a plurality of words ranked first may be used as matching results.
Further, said matching the current character with a word included in an entity dictionary is performed with the consecutive number as the current character, including
If the continuous number comprises Chinese numbers and is provided with Chinese units, converting the Chinese numbers in the continuous number into Arabic numbers;
the Arabic numerals are combined with the Chinese units and then matched with words in the entity dictionary.
For example, the natural language text includes "thirty-one 5 months", and the processing results in "31 months" which is a character string formed by combining arabic numerals and chinese units.
Further, the matching with the words included in the entity dictionary after combining the arabic numerals and the chinese units includes:
combining Arabic numerals and Chinese units, and performing digital generalization processing to obtain a first generalization result so as to ignore the influence of specific numerals;
and matching the first generalization result with words included in the entity dictionary.
For example, the first generalization result corresponding to "month 5 31" is "N month N", where "N month N" can be successfully matched with the prefix of the word "N month N number" in the entity dictionary.
Further, the method further comprises:
if the matching result shows the entity of the time category corresponding to the continuous number, judging whether the time high order of the matching result is complete;
and if the time high order of the matching result is judged to be incomplete, the time high order of the matching result is filled according to the current time.
For example, "n.n.n" lacks temporal high order and needs to be filled to the form of "n.n.n.n.n.n.n.2020, 5.31 days" after filling, and then can be converted to a computer-recognizable form, e.g., "Timespot (2020-05-31)".
According to the method provided by the embodiment of the specification, firstly, a natural language text input by a user is obtained, wherein the natural language text is used for expressing the data analysis requirement of the user on target data; then acquiring entity dictionaries of multiple categories constructed based on a historical natural language corpus set and the target data, wherein the multiple categories are related to data dimensions and/or data analysis intents of the target data; and finally, aiming at the characters included in the natural language text, executing matching processing of the characters and words included in the entity dictionaries of the multiple categories, and taking a matching result as the entity in the identified corresponding category. As can be seen from the above, in the embodiments of the present specification, in the face of a natural language text input by a user, instead of performing entity recognition by using a normal deep learning method, for characters included in the natural language text, a matching process between the characters and words included in entity dictionaries of multiple categories is performed, and a matching result is used as an entity in a corresponding category that is recognized, where the entity dictionaries of the multiple categories are constructed based on a historical natural language corpus and the target data, and the multiple categories are related to data dimensions and/or data analysis intents of the target data, so that understanding of data analysis requirements of the user by using the recognized entity and the category thereof is facilitated, and the recognition process can meet requirements of accuracy and interpretability in data analysis.
According to an embodiment of another aspect, an entity identification device in data analysis is also provided, and the device is used for executing the method provided by the embodiment of the present specification. FIG. 9 shows a schematic block diagram of an entity recognition arrangement in data analysis according to one embodiment. As shown in fig. 9, the apparatus 900 includes:
a first obtaining unit 91, configured to obtain a natural language text input by a user, where the natural language text is used to express a data analysis requirement of the user on target data;
a second obtaining unit 92, configured to obtain entity dictionaries of multiple categories, which are constructed based on a set of historical natural language corpora and the target data, and are related to data dimensions and/or data analysis intents of the target data;
a matching unit 93 configured to perform a matching process of characters with words included in the entity dictionaries of the plurality of categories acquired by the second acquiring unit 92 for the characters included in the natural language text acquired by the first acquiring unit 91, and take the matching result as an entity in the identified corresponding category.
Optionally, as an embodiment, the data analysis requirement includes querying a first range of the target data, and performing a first-way statistical analysis on the first range of the target data.
Optionally, as an embodiment, the second obtaining unit 92 includes:
the system comprises a first acquisition subunit, a second acquisition subunit and a third acquisition subunit, wherein the first acquisition subunit is used for acquiring a global dictionary constructed based on a historical natural language corpus set;
a second acquisition subunit configured to acquire a dictionary constructed based on metadata information and data information of a target database to which the target data belongs; the global dictionary and the proprietary dictionary together constitute an entity dictionary for the plurality of categories.
Optionally, as an embodiment, the plurality of categories include at least one of a time category, a unit category, an intention category, a dimension category, and a dimension value category; the dimension category corresponds to a field name in a target database to which the target data belongs, and the dimension value category corresponds to a specific value of a field in the target database.
Further, each word in the proprietary dictionary is stored in a triplet form that includes a name of a data table, a category name, and a field name.
Optionally, as an embodiment, the matching unit 93 is specifically configured to sequentially perform, in a multiple-round iterative manner, matching processing between a current character and words included in the entity dictionaries of the multiple categories; and in each iteration, matching the current character with the words in the entity dictionary, if the matching is successful, ending the iteration of the current round, if the matching is unsuccessful, combining the current character with the next character, matching the combined character string with the words in the entity dictionary, and ending the iteration of the current round until the matching is successful.
Further, the matching unit 93 includes:
the first matching subunit is used for confirming that the target word is an accurate matching result of the character string if the combined character string is completely consistent with the target word in the entity dictionary;
the second matching subunit is used for confirming that the target word is a prefix matching result of the character string if the combined character string is consistent with the target word part in the entity dictionary and the character string belongs to the prefix part of the target word; and if the character string has an accurate matching result and also has a prefix matching result, selecting the accurate matching result as a final matching result.
Further, the matching unit 93 includes:
a judging subunit, configured to, in each iteration, judge whether there are consecutive digits in the natural language text before matching a current character with a word included in an entity dictionary;
and the processing subunit is configured to, if the judging subunit judges that there is a consecutive number, treat the consecutive number as a single character, treat the consecutive number as a current character, and perform the matching between the current character and a word included in the entity dictionary.
Further, the processing subunit includes
A conversion module for converting the Chinese number in the continuous number into an Arabic number if the continuous number includes the Chinese number and has a Chinese unit;
and the matching module is used for combining the Arabic numbers obtained by the conversion module with the Chinese units and then matching the Arabic numbers with the words in the entity dictionary.
Further, the matching module is specifically configured to combine the arabic numerals and the chinese units, and then perform digital generalization processing to obtain a first generalization result so as to ignore the influence of specific numerals; and matching the first generalization result with words included in the entity dictionary.
Further, the processing subunit further includes:
the judging module is used for judging whether the time high order of the matching result is complete or not if the matching result obtained by the matching module shows the entity of the time category corresponding to the continuous number;
and the completion module is used for completing the time high position of the matching result according to the current time if the judgment module judges that the time high position of the matching result is incomplete.
With the apparatus provided in this specification, first, the first obtaining unit 91 obtains a natural language text input by a user, where the natural language text is used to express a data analysis requirement of the user on target data; then, the second obtaining unit 92 obtains entity dictionaries of multiple categories, which are constructed based on the historical natural language corpus set and the target data, and the multiple categories are related to data dimensions and/or data analysis intents of the target data; finally, the matching unit 93 performs matching processing of characters with words included in the entity dictionaries of the plurality of categories for the characters included in the natural language text, and takes the matching result as an entity in the identified corresponding category. As can be seen from the above, in the embodiments of the present specification, in the face of a natural language text input by a user, instead of performing entity recognition by using a normal deep learning method, for characters included in the natural language text, a matching process between the characters and words included in entity dictionaries of multiple categories is performed, and a matching result is used as an entity in a corresponding category that is recognized, where the entity dictionaries of the multiple categories are constructed based on a historical natural language corpus and the target data, and the multiple categories are related to data dimensions and/or data analysis intents of the target data, so that understanding of data analysis requirements of the user by using the recognized entity and the category thereof is facilitated, and the recognition process can meet requirements of accuracy and interpretability in data analysis.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 3 or fig. 8.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 3 or fig. 8.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (24)

1. A method of entity identification in data analysis, the method comprising:
acquiring a natural language text input by a user, wherein the natural language text is used for expressing the data analysis requirement of the user on target data;
acquiring entity dictionaries of multiple categories constructed based on a historical natural language corpus set and the target data, wherein the multiple categories are related to data dimensions and/or data analysis intents of the target data;
and executing matching processing of the characters and words included in the entity dictionaries of the plurality of categories for the characters included in the natural language text, and taking a matching result as an entity in the identified corresponding category.
2. The method of claim 1, wherein the data analysis requirements include querying a first range of the target data and performing a first manner of statistical analysis on the first range of the target data.
3. The method of claim 1, wherein the obtaining a plurality of categories of entity dictionaries constructed based on a set of historical natural language corpora and the target data comprises:
acquiring a global dictionary constructed based on a historical natural language corpus set;
acquiring a proprietary dictionary constructed based on metadata information and data information of a target database to which the target data belongs; the global dictionary and the proprietary dictionary together constitute an entity dictionary for the plurality of categories.
4. The method of claim 1, wherein the plurality of categories include at least one of a time category, a units category, an intent category, a dimensions category; the dimension category corresponds to a field name in a target database to which the target data belongs, and the dimension value category corresponds to a specific value of a field in the target database.
5. The method of claim 3, wherein each word in the proprietary dictionary is stored in a triplet form, the triplet including a name of a data table, a category name, and a field name.
6. The method of claim 1, wherein the performing matching processing of characters with words included in the entity dictionaries of the plurality of categories includes:
sequentially executing matching processing of the current character and the words in the entity dictionaries of the multiple categories in sequence in a multi-round iteration mode; and in each iteration, matching the current character with the words in the entity dictionary, if the matching is successful, ending the iteration of the current round, if the matching is unsuccessful, combining the current character with the next character, matching the combined character string with the words in the entity dictionary, and ending the iteration of the current round until the matching is successful.
7. The method of claim 6, wherein the matching the combined character string with words included in an entity dictionary comprises:
if the combined character string is completely consistent with the target word in the entity dictionary, confirming that the target word is an accurate matching result of the character string;
if the combined character string is consistent with a target word part in an entity dictionary and the character string belongs to a prefix part of the target word, confirming that the target word is a prefix matching result of the character string; and if the character string has an accurate matching result and also has a prefix matching result, selecting the accurate matching result as a final matching result.
8. The method of claim 6, wherein, prior to matching the current character with words included in the entity dictionary in each iteration, further comprising:
judging whether continuous numbers exist in the natural language text or not;
and if the continuous numbers exist, processing the continuous numbers as a single character, using the continuous numbers as the current character, and executing the matching of the current character and the words in the entity dictionary.
9. The method of claim 8, wherein said matching the current character to words included in the entity dictionary is performed using the consecutive digits as the current character, including
If the continuous number comprises Chinese numbers and is provided with Chinese units, converting the Chinese numbers in the continuous number into Arabic numbers;
the Arabic numerals are combined with the Chinese units and then matched with words in the entity dictionary.
10. The method of claim 9, wherein the combining the arabic numerals with the chinese units to match words included in the entity dictionary comprises:
combining Arabic numerals and Chinese units, and performing digital generalization processing to obtain a first generalization result so as to ignore the influence of specific numerals;
and matching the first generalization result with words included in the entity dictionary.
11. The method of claim 10, wherein the method further comprises:
if the matching result shows the entity of the time category corresponding to the continuous number, judging whether the time high order of the matching result is complete;
and if the time high order of the matching result is judged to be incomplete, the time high order of the matching result is filled according to the current time.
12. An entity identification apparatus in data analysis, the apparatus comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a natural language text input by a user, and the natural language text is used for expressing the data analysis requirement of the user on target data;
a second obtaining unit, configured to obtain entity dictionaries of multiple categories, which are constructed based on a historical natural language corpus set and the target data, and are related to data dimensions and/or data analysis intents of the target data;
and a matching unit configured to perform matching processing of characters and words included in the entity dictionaries of the plurality of categories acquired by the second acquisition unit for the characters included in the natural language text acquired by the first acquisition unit, and take a matching result as an entity in the identified corresponding category.
13. The apparatus of claim 12, wherein the data analysis requirements include querying a first range of the target data and performing a first manner of statistical analysis on the first range of the target data.
14. The apparatus of claim 12, wherein the second obtaining unit comprises:
the system comprises a first acquisition subunit, a second acquisition subunit and a third acquisition subunit, wherein the first acquisition subunit is used for acquiring a global dictionary constructed based on a historical natural language corpus set;
a second acquisition subunit configured to acquire a dictionary constructed based on metadata information and data information of a target database to which the target data belongs; the global dictionary and the proprietary dictionary together constitute an entity dictionary for the plurality of categories.
15. The apparatus of claim 12, wherein the plurality of categories include at least one of a time category, a units category, an intent category, a dimensions category; the dimension category corresponds to a field name in a target database to which the target data belongs, and the dimension value category corresponds to a specific value of a field in the target database.
16. The apparatus of claim 14, wherein each word in the proprietary dictionary is stored in a triplet form, the triplet including a name of a data table, a category name, and a field name.
17. The apparatus according to claim 12, wherein the matching unit is specifically configured to sequentially perform, in an order through multiple rounds of iteration, matching processing of the current character with words included in the entity dictionaries of the multiple categories; and in each iteration, matching the current character with the words in the entity dictionary, if the matching is successful, ending the iteration of the current round, if the matching is unsuccessful, combining the current character with the next character, matching the combined character string with the words in the entity dictionary, and ending the iteration of the current round until the matching is successful.
18. The apparatus of claim 17, wherein the matching unit comprises:
the first matching subunit is used for confirming that the target word is an accurate matching result of the character string if the combined character string is completely consistent with the target word in the entity dictionary;
the second matching subunit is used for confirming that the target word is a prefix matching result of the character string if the combined character string is consistent with the target word part in the entity dictionary and the character string belongs to the prefix part of the target word; and if the character string has an accurate matching result and also has a prefix matching result, selecting the accurate matching result as a final matching result.
19. The apparatus of claim 17, wherein the matching unit comprises:
a judging subunit, configured to, in each iteration, judge whether there are consecutive digits in the natural language text before matching a current character with a word included in an entity dictionary;
and the processing subunit is configured to, if the judging subunit judges that there is a consecutive number, treat the consecutive number as a single character, treat the consecutive number as a current character, and perform the matching between the current character and a word included in the entity dictionary.
20. The apparatus of claim 19, wherein the processing subunit comprises
A conversion module for converting the Chinese number in the continuous number into an Arabic number if the continuous number includes the Chinese number and has a Chinese unit;
and the matching module is used for combining the Arabic numbers obtained by the conversion module with the Chinese units and then matching the Arabic numbers with the words in the entity dictionary.
21. The apparatus according to claim 20, wherein the matching module is specifically configured to combine the arabic numerals and the chinese units, and then perform a digital generalization process to obtain a first generalization result, so as to ignore the influence of the specific numerals; and matching the first generalization result with words included in the entity dictionary.
22. The apparatus of claim 21, wherein the processing subunit further comprises:
the judging module is used for judging whether the time high order of the matching result is complete or not if the matching result obtained by the matching module shows the entity of the time category corresponding to the continuous number;
and the completion module is used for completing the time high position of the matching result according to the current time if the judgment module judges that the time high position of the matching result is incomplete.
23. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.
24. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-11.
CN202210058350.8A 2022-01-19 2022-01-19 Entity identification method and device in data analysis Active CN114138945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210058350.8A CN114138945B (en) 2022-01-19 2022-01-19 Entity identification method and device in data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210058350.8A CN114138945B (en) 2022-01-19 2022-01-19 Entity identification method and device in data analysis

Publications (2)

Publication Number Publication Date
CN114138945A true CN114138945A (en) 2022-03-04
CN114138945B CN114138945B (en) 2022-06-14

Family

ID=80381760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210058350.8A Active CN114138945B (en) 2022-01-19 2022-01-19 Entity identification method and device in data analysis

Country Status (1)

Country Link
CN (1) CN114138945B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
US20170364503A1 (en) * 2016-06-17 2017-12-21 Abbyy Infopoisk Llc Multi-stage recognition of named entities in natural language text based on morphological and semantic features
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data
US20180210883A1 (en) * 2017-01-25 2018-07-26 Dony Ang System for converting natural language questions into sql-semantic queries based on a dimensional model
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN111368547A (en) * 2020-03-09 2020-07-03 中国平安人寿保险股份有限公司 Entity identification method, device, equipment and storage medium based on semantic analysis
CN112464667A (en) * 2020-11-18 2021-03-09 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
CN112800201A (en) * 2021-01-28 2021-05-14 杭州汇数智通科技有限公司 Natural language processing method and device and electronic equipment
CN113051898A (en) * 2019-12-27 2021-06-29 北京阿博茨科技有限公司 Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
US20170364503A1 (en) * 2016-06-17 2017-12-21 Abbyy Infopoisk Llc Multi-stage recognition of named entities in natural language text based on morphological and semantic features
US20180210883A1 (en) * 2017-01-25 2018-07-26 Dony Ang System for converting natural language questions into sql-semantic queries based on a dimensional model
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data
CN108491373A (en) * 2018-02-01 2018-09-04 北京百度网讯科技有限公司 A kind of entity recognition method and system
CN113051898A (en) * 2019-12-27 2021-06-29 北京阿博茨科技有限公司 Word meaning accumulation and word segmentation method, tool and system for structured data searched by natural language
CN111368547A (en) * 2020-03-09 2020-07-03 中国平安人寿保险股份有限公司 Entity identification method, device, equipment and storage medium based on semantic analysis
CN112464667A (en) * 2020-11-18 2021-03-09 北京华彬立成科技有限公司 Text entity identification method and device, electronic equipment and storage medium
CN112800201A (en) * 2021-01-28 2021-05-14 杭州汇数智通科技有限公司 Natural language processing method and device and electronic equipment

Also Published As

Publication number Publication date
CN114138945B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN110543517B (en) Method, device and medium for realizing complex query of mass data based on elastic search
CN100562870C (en) Translating equipment and interpretation method
CN108182972B (en) Intelligent coding method and system for Chinese disease diagnosis based on word segmentation network
US11132372B2 (en) Method and apparatus for precise positioning of scholar based on mining of scholar's scientific research achievement
CN108182207B (en) Intelligent coding method and system for Chinese surgical operation based on word segmentation network
CN104866593A (en) Database searching method based on knowledge graph
CN102063482B (en) High-efficiency contact searching method of handheld device
CN104657439A (en) Generation system and method for structured query sentence used for precise retrieval of natural language
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN109147767A (en) Digit recognition method, device, computer equipment and storage medium in voice
CN103970798A (en) Technology for searching and matching data
CN111190920A (en) Data interactive query method and system based on natural language
CN115794833A (en) Data processing method, server and computer storage medium
CN114138945B (en) Entity identification method and device in data analysis
CN112148735B (en) Construction method for structured form data knowledge graph
CN117290376A (en) Two-stage Text2SQL model, method and system based on large language model
CN114090722B (en) Method and device for automatically completing query content
CN111797279A (en) Data storage method and device
JPH10232877A (en) Collation device for character string and data base system
CN115422180A (en) Data verification method and system
JP2008059389A (en) Vocabulary candidate output system, vocabulary candidate output method, and vocabulary candidate output program
JP2003331214A (en) Character recognition error correction method, device and program
CN114218935B (en) Entity display method and device in data analysis
CN112733528B (en) Code matching method, device and equipment for medical data and storage medium
CN111158500A (en) Method and device for improving input efficiency by using wildcard

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant