CN111309851A - Entity word storage method and device and electronic equipment - Google Patents

Entity word storage method and device and electronic equipment Download PDF

Info

Publication number
CN111309851A
CN111309851A CN202010091208.4A CN202010091208A CN111309851A CN 111309851 A CN111309851 A CN 111309851A CN 202010091208 A CN202010091208 A CN 202010091208A CN 111309851 A CN111309851 A CN 111309851A
Authority
CN
China
Prior art keywords
text
stored
text character
entity word
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010091208.4A
Other languages
Chinese (zh)
Other versions
CN111309851B (en
Inventor
许晏铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Internet Security Software Co Ltd
Original Assignee
Beijing Kingsoft Internet Security Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Internet Security Software Co Ltd filed Critical Beijing Kingsoft Internet Security Software Co Ltd
Priority to CN202010091208.4A priority Critical patent/CN111309851B/en
Publication of CN111309851A publication Critical patent/CN111309851A/en
Application granted granted Critical
Publication of CN111309851B publication Critical patent/CN111309851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a device for storing entity words and electronic equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining entity word data to be stored, determining type text characters corresponding to entity word types to be stored based on a corresponding relation between the entity word types and the type text characters which are established in advance, adding target type text characters before initial text characters of the entity words to be stored to obtain text character strings, and storing the text character strings to be stored into a dictionary tree which is established in advance.

Description

Entity word storage method and device and electronic equipment
Technical Field
The present invention relates to the field of data storage technologies, and in particular, to a method and an apparatus for storing entity words, and an electronic device.
Background
With the popularization of the internet and the development of internet technology, people are more willing to search solutions to related problems or detailed information and purchasing links of detailed items through the internet when encountering problems or seeking a favorite item of the user in daily life and work, for example, when the user is interested in traveling, the user may search for "how well the ken X-based fried chicken in australia X street? ".
Technically, the content of the user search is represented by query, and the entity words in the query searched by the user are identified, so that the understanding of a subsequent Natural Language Processing (NLP) architecture module on the user search intention can be facilitated. The entity words in the query refer to entities with specific meanings in text characters, and include names of people, places, organizations, proper nouns and the like, and characters such as time, quantity, currency, proportional numerical values and the like.
Conventionally, in order to identify entity words in a query, an entity word matching method is generally adopted, in brief, an entity word database is constructed by collecting entity words in advance, and when the entity words in the query need to be identified, words included in the query are searched in the entity word database, for example, for how well a query "kikex-street positive X-based fried chicken" eats, "the entity words" kikex-street positive X-based "and" fried chicken "are searched in the database, and then the entity words of the query include" kikex-street "and" kikex-based "and" fried chicken ".
The inventor finds that the prior art at least has the following problems in the process of implementing the invention:
entity words of multiple types of entity word types may be included in a query, for example, in the query described above, entity word "Australian X street" belongs to location type, "KeNX base" belongs to brand type and food type, and "Fried chicken" belongs to food type. For different analysis requirements, the requirements for the types of the entity words are different, for example, when the brand requirements of the user are researched, only the entity words belonging to the brand types in the query may be needed, and the entity words of multiple entity word types in the query may be all identified and processed by adopting the prior art, so that the identified entity words also need to be searched continuously to obtain the entity words belonging to the specified entity word types, and the efficiency is low.
Disclosure of Invention
The embodiment of the invention aims to provide an entity word storage method to improve the efficiency of identifying entity words of specified entity word types in a query. The specific technical scheme is as follows:
the embodiment of the invention provides an entity word storage method, which comprises the following steps:
acquiring entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored;
determining type text characters corresponding to the entity word types to be stored as target type text characters based on a corresponding relation between the entity word types and the type text characters established in advance;
adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string as the text character string to be stored;
and storing the text character string to be stored into a pre-established dictionary tree, wherein child nodes of a root node in the dictionary tree are used for storing type text characters.
Further, the storing the text character string to be stored into a pre-established dictionary tree includes:
determining sub-text character strings of the text character strings to be stored, which are stored in the dictionary tree, as stored text character strings, and determining sub-text character strings of the text character strings to be stored in the dictionary tree, which are used as non-stored text character strings;
and sequentially storing the text characters from the initial text character of the text character string which is not stored according to the composition sequence of the text characters in the text character string which is not stored.
Further, a starting text character of the unstored text character string is used as an unstored starting text character;
the sequentially storing the text characters from the initial text character of the text character string not stored according to the composition sequence of the text characters in the text character string not stored comprises:
determining the offset parameter of the previous text character stored in the initial text character which is not stored as a target offset parameter in the text character string to be stored;
determining a storage position where the starting text character is not stored as a target storage position based on the target offset parameter;
storing the initial text characters which are not stored according to the target storage position;
and updating the unstored text character string and the unstored initial text character, and returning to execute the step of determining the offset parameter of the previous text character stored in the unstored initial text character until each text character in the unstored text character string is stored.
Further, the determining, as a target offset parameter, an offset parameter of a previous text character in which the starting text character is not stored includes:
taking the predetermined offset parameter as the target offset parameter in the case that a predetermined offset parameter already exists in the previous text character in which the initial text character is not stored; alternatively, the first and second electrodes may be,
and determining the offset parameter with the smallest numerical value which is not used as the target offset parameter in the case that the offset parameter which is determined in advance does not exist in the previous text character in which the initial text character is not stored.
Further, the determining, as a target storage location, a storage location where the starting text character is not stored based on the target offset parameter includes:
calculating the sum of the target offset parameter and the coding numerical value of the unstored initial text character as a preprocessing numerical value;
determining a storage index of the initial text character which is not stored as a target storage index based on the preprocessing numerical value;
and determining a storage position associated with the target storage index as the target storage position based on the pre-established association relationship between the storage index and the storage position.
Further, the method further comprises:
and after the terminal text characters in the text character string to be stored are stored, adding entity word identification nodes on the basis of the nodes where the terminal text characters are located, wherein the entity word identification nodes are used for indicating that the text characters contained in the father nodes of the entity word identification nodes are the terminal text characters.
Further, each storage position in the dictionary tree is located in a plurality of storage blocks divided in advance.
The embodiment of the invention provides a method for identifying entity words in a text segment, which comprises the following steps:
acquiring text segment data to be recognized, wherein the text segment data comprises a text segment to be recognized and a specified entity word type, and the specified entity word type is determined based on received entity word type selection operation;
determining type text characters corresponding to the entity word types as specified type text characters based on a pre-established corresponding relationship between the entity word types and the type text characters;
and recognizing the text segment to be recognized based on the specified type text characters and a pre-established dictionary tree to obtain a recognition result, taking the recognition result as an entity word belonging to the specified entity word type, wherein child nodes of a root node in the dictionary tree are used for storing the type text characters.
Further, the recognizing the text segment to be recognized based on the text character of the specified type and a pre-established dictionary tree to obtain a recognition result, including:
determining child nodes containing the specified type text characters in all child nodes of the following nodes of the pre-established dictionary tree as target nodes;
and identifying the text segment to be identified from the target node to obtain an identification result.
Further, the recognizing the text segment to be recognized starting from the target node to obtain a recognition result, including:
determining text characters contained in each subword node of the target node in each text character contained in the text segment to be recognized as target text characters;
if the child nodes of the node to which the target text character belongs comprise entity word identification nodes, determining that a text character string corresponding to the node to which the target text character belongs is a recognition result, wherein the entity word identification nodes are terminal text characters used for indicating that text characters contained in father nodes of the entity word are entity words;
and if the next text character of the target text character is contained in the child node of the node to which the target text character belongs in the text segment to be recognized, returning the next text character as the target text character to execute the step of determining the text character string corresponding to the node to which the target text character belongs as a recognition result if the child node of the node to which the target text character belongs comprises the entity word identification node.
An embodiment of the present invention further provides an entity word storage device, where the device includes:
the entity word obtaining module is used for obtaining entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored;
the first type text character determining module is used for determining type text characters corresponding to the entity word types to be stored as target type text characters based on the corresponding relationship between the entity word types and the type text characters established in advance;
the adding module is used for adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string which is used as the text character string to be stored;
and the storage module is used for storing the text character string to be stored into a pre-established dictionary tree, and the child nodes of the root node in the dictionary tree are used for storing the type text characters.
Further, the storage module is specifically configured to determine, as the stored text character string, a sub-text character string of the text character string to be stored that is already stored in the trie, determine, as an un-stored text character string, a sub-text character string of the text character string to be stored in the trie, and sequentially store, starting from a starting text character of the un-stored text character string, each text character in the un-stored text character string according to a composition order of each text character in the un-stored text character string.
Further, a starting text character of the unstored text character string is used as an unstored starting text character;
the storage module is specifically configured to determine, in the text string to be stored, an offset parameter of a previous text character in which the starting text character is not stored, as a target offset parameter, determine, based on the target offset parameter, a storage location of the starting text character which is not stored, as a target storage location, store the starting text character which is not stored according to the target storage location, update the text string which is not stored and the starting text character which is not stored, and return to execute the offset parameter of the previous text character in which the starting text character which is not stored is determined to be stored until each text character in the text string which is not stored is stored.
Further, the storage module is specifically configured to, when a predetermined offset parameter already exists in a previous text character in which the starting text character is not stored, take the predetermined offset parameter as the target offset parameter; or, in the case that the previous text character in which the initial text character is not stored does not have a predetermined offset parameter, determining the offset parameter with the smallest numerical value which is not used as the target offset parameter.
Further, the storage module is specifically configured to calculate a sum of the target offset parameter and the encoding numerical value of the unstored start text character as a preprocessed numerical value, determine, based on the preprocessed numerical value, a storage index of the unstored start text character as a target storage index, and determine, based on a pre-established association relationship between the storage index and the storage location, a storage location associated with the target storage index as the target storage location.
Further, the storage module is further configured to add an entity word identification node on the basis of the node where the terminal text character is located after the terminal text character in the text character string to be stored is stored, where the entity word identification node is used to indicate that the text character included in the parent node of the entity word identification node is the terminal text character.
Further, each storage position in the dictionary tree is located in a plurality of storage blocks divided in advance.
The embodiment of the invention also provides a device for recognizing the entity words in the text segment, which comprises:
the system comprises a text segment data acquisition module, a text segment data acquisition module and a text segment recognition module, wherein the text segment data acquisition module is used for acquiring text segment data to be recognized, the text segment data comprises a text segment to be recognized and a specified entity word type, and the specified entity word type is determined based on received entity word type selection operation;
the second type text character determining module is used for determining type text characters corresponding to the entity word types based on the pre-established corresponding relationship between the entity word types and the type text characters as specified type text characters;
a recognition module for recognizing the text segment to be recognized based on the specified type text characters and a pre-established dictionary tree to obtain a recognition result, and taking the recognition result as an entity word belonging to the specified entity word type, wherein the child nodes of the root node in the dictionary tree are used for storing the type text characters
Further, the recognition module is specifically configured to determine, in each child node of a following node of a pre-established dictionary tree, a child node including the text character of the specified type as a target node, and recognize the text segment to be recognized starting from the target node to obtain a recognition result.
Further, the recognition module is specifically configured to determine, among text characters included in the text segment to be recognized, text characters included in each subword node of the target node as target text characters, and if the children nodes of the node to which the target text characters belong include entity word identification nodes, determine a text character string corresponding to the node to which the target text characters belong as a recognition result, where the entity word identification nodes are terminal text characters used to indicate that text characters included in parent nodes thereof are entity words, and if, in the text segment to be recognized, a subsequent text character of the target text character is included in the children nodes of the node to which the target text character belongs, take the subsequent text character as the target text character, and return to execute the step if the child nodes of the node to which the target text character belongs include entity word identification nodes, determining the text character string corresponding to the node to which the target text character belongs as a recognition result.
The embodiment of the invention also provides electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the steps of the entity word storage or the entity word recognition method in the text segment when executing the program stored in the memory.
The embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the steps of any one of the entity word storage method or the entity word recognition storage method in the text segment are implemented.
Embodiments of the present invention further provide a computer program product including instructions, which when run on a computer, cause the computer to execute any of the above-mentioned entity word storage or entity word recognition methods in a text segment.
In the scheme, the entity word data to be stored is obtained, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, the entity word types to which the entity words to be stored belong are used as entity word types to be stored, type text characters corresponding to the entity word types to be stored are determined based on a pre-established corresponding relation between the entity word types and the type text characters and are used as target type text characters, the target type text characters are added before initial text characters of the entity words to be stored to obtain text character strings which are used as text character strings to be stored, the text character strings to be stored are stored in a pre-established dictionary tree, child nodes of a root node in the dictionary tree are used for storing the type text characters, and the entity words stored in the dictionary tree have entity word characters capable of distinguishing the entity word types, therefore, a storage structure capable of rapidly distinguishing different entity word types is provided for entity words of specified entity word types in the query, and the efficiency of identifying the entity words of the specified entity word types in the query is improved.
Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
Fig. 1 is a flowchart of an entity word storage method according to an embodiment of the present invention;
FIG. 2 is a diagram of a typical trie;
FIG. 3 is a diagram of a dictionary tree provided in accordance with one embodiment of the present invention;
FIG. 4 is a diagram of a dictionary tree according to another embodiment of the present invention;
fig. 5 is a flowchart of an entity word storage method according to another embodiment of the present invention;
fig. 6 is a flowchart of an entity word storage method according to another embodiment of the present invention;
FIG. 7 is a diagram of a trie according to another embodiment of the present invention;
FIG. 8 is a diagram illustrating a storage route of text characters in a dictionary tree according to an embodiment of the present invention;
FIG. 9 is a flowchart of a method for recognizing entity words in a text field according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a entity word storage device according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a device for identifying an entity word in a text segment according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to provide an implementation scheme for improving efficiency of identifying entity words of a specified entity word type in a query, embodiments of the present invention provide an entity word storage method, an entity word storage device, and electronic equipment, and the following describes embodiments of the present invention with reference to drawings of the specification. And the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
In an embodiment of the present invention, there is provided an entity word storage method, as shown in fig. 1, the method including the steps of:
s101: and acquiring entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored.
S102: and determining type text characters corresponding to the entity word types to be stored as target type text characters based on the pre-established corresponding relationship between the entity word types and the type text characters.
S103: and adding target type text characters before the initial text characters of the entity words to be stored to obtain text character strings which are used as the text character strings to be stored.
S104: and storing the text character strings to be stored into a pre-established dictionary tree, wherein child nodes of a root node in the dictionary tree are used for storing type text characters.
In the above entity word storage method provided in this embodiment of the present invention, entity word data to be stored is obtained, where the entity word data includes an entity word to be stored and an entity word type to which the entity word to be stored belongs, the entity word type to which the entity word to be stored belongs is used as the entity word type to be stored, and based on a correspondence between the entity word type and the type text character that are established in advance, a type text character corresponding to the entity word type to be stored is determined as a target type text character, and a target type text character is added before a start text character of the entity word to be stored, so as to obtain a text character string to be stored, and the text character string to be stored is stored in a dictionary tree that is established in advance, and a child node of a root node in the dictionary tree is used for storing the type text character, because the entity word stored in the dictionary tree has an entity word character that can distinguish the entity word type, therefore, a storage structure capable of rapidly distinguishing different entity word types is provided for entity words of specified entity word types in the query, and the efficiency of identifying the entity words of the specified entity word types in the query is improved.
In the entity word storage method shown in fig. 1 provided in the embodiment of the present invention, for step S101, entity words to be stored included in the entity word data to be stored may be collected in advance, and optionally, the collection of the entity words may be implemented in various ways.
For example, for a certain vertical scene, a vertical website associated with the vertical scene may be obtained first, for example, for an automobile scene, the vertical website associated with the automobile scene may be a website introduced about an automobile in the internet, an official website of an automobile manufacturer, or an automobile transaction website, for a game scene, the vertical website associated with the game scene may be a game forum, a website providing a game downloading service, or a question and answer website related to a game, and further, text data in the vertical website is extracted and processed to obtain each entity word included in the text data.
In order to improve the subsequent searching of the entity words belonging to the specific vertical scene in the specified text segment, when the entity words are stored, the entity word type of the entity words to be stored needs to be determined in advance.
Optionally, the entity word type of the entity word to be stored may be determined based on a source from which the entity word is obtained, for example, since text data in a vertical website associated with a vertical scene is more than that associated with the vertical field, the entity word type of the entity word obtained from the vertical website may be set as the entity word type associated with the vertical website, and optionally, in order to accurately determine the collected entity word type of the entity word, an artificial tagging manner may be adopted, or recognition of the entity word type may be performed by training a neural network model.
Optionally, the entity word data to be stored may be recorded in the entity word table to be stored, and the entity word data to be stored is obtained by obtaining the entity word table to be stored, and further, the entity word data to be stored may be recorded in a format (entity word: entity word type), such as (suit: clothing type), where the suit is the entity word to be stored and the clothing type is the entity word type to which the suit belongs.
With respect to step S102, one type of text character is used to represent one entity part of speech type, so that the type text characters corresponding to each entity part of speech type may be different from each other, and thus the entity word types may be distinguished only by the type text characters.
The type text character corresponding to each entity word type can be determined according to actual requirements.
For example, if the entity word type is a location type, the type text character corresponding to the entity word type may be represented by a word, such as "bit", a number, such as "1", or a letter, such as "a".
Optionally, the pre-established correspondence between the entity word type and the type text character may be recorded in the form of a correspondence table, for example, as shown in table 1 below:
TABLE 1
Location type Brand type Type of food Types of garments
A B C D
In table 1, the first behavior is an entity word type, the second behavior is a type text character having a corresponding relationship with the entity word type, where A, B, C and D may be any text characters, and for storage convenience, a sequence number may be used as a type text character corresponding to each entity word type, for example, a may be 1, B may be 2, C may be 3, and D may be 4. It should be noted that type text characters corresponding to each entity part word type are different from each other.
For step S103, the initial text character of the entity word to be stored is the first text character of the entity word to be stored, and similarly, the end text character is the last text character of the entity word to be stored, for example, the initial text character of the entity word to be stored is "au X street" or "au X street" and the end text character is "street".
For example, if the entity word to be stored is "australian X street", and the target type text character is a, the text string to be stored may be obtained as "australian X street a".
With respect to step S104, the children nodes of the root node in the pre-established trie are used to store type text characters.
For the sake of understanding, the storage manner of the dictionary tree is briefly introduced first:
as shown in fig. 2, a diagram of a typical dictionary tree in the prior art is shown, where a root node does not include text characters, and each child node except the root node includes a text character, and for a node in the dictionary tree, the child nodes of the node include different text characters, a node path of the node is a path from the root node to the node, and a text character string corresponding to the node is a character string formed by text words passed by the node path of the node in a path order.
In the dictionary tree shown in fig. 2, the gray nodes indicate that the text character strings corresponding to the nodes are entity words, and the text characters contained in the gray nodes are the end text characters of the entity words.
Illustratively, in the dictionary tree shown in fig. 2, the path of the node r is root node-a-c-d-r, the text character string corresponding to the node r is acdr, the path of the node f whose parent node is node c is root node-a-c-f, the text character string corresponding to the node r is acf, the path of the node f whose parent node is node b is root node-b-f, the text character string corresponding to the node f is bf, and the dictionary tree shown in fig. 2 stores entity words ab, acd, acdr, acf, ba, and bf.
For each embodiment provided by the application, the child nodes of the root node in the dictionary tree are not used for storing text characters contained in entity words to be stored, but are used for storing type text characters corresponding to entity word types to which entities to be stored belong, so that the nodes for storing entity words belonging to the same entity word type are located below the nodes to which the text characters of the same type belong, and the entity words of each entity word type are divided according to the entity word types during storage.
As shown in fig. 3, for the schematic diagram of the dictionary tree provided in the embodiment of the present invention, a root node in the diagram includes three child nodes, and type text characters included in the three child nodes are A, B and C, where the type text characters A, B and C represent different types of entity words, respectively, and under a node to which the type text character a belongs, an entity word "australian X street" exists, under a node to which the type text character B belongs, an entity word "ken X chicken" exists, and under a node to which the type text character C belongs, an entity word "fried chicken" exists.
When the entity word to be stored is 'gold X gate' and the type text character is B, adding the type text character B before the initial text character 'gold' of the 'gold X gate' to obtain a text character string 'B gold X gate' to be stored, and storing the 'B gold X gate' into the dictionary tree diagram shown in fig. 3 to obtain the dictionary tree diagram shown in fig. 4.
On the basis of the foregoing embodiment, an embodiment of the present invention further provides another entity word storage method, which is used for implementing step S104 in the entity word storage method shown in fig. 1, and as shown in fig. 5, the method includes:
s501: determining the sub-text character strings in the text character string to be stored, which are stored in the dictionary tree, as the stored text character strings, and determining the sub-text character strings in the text character string dictionary tree to be stored, which are not stored.
In this step, for each text string to be stored, the text string to be stored may be divided into two parts, where the first part is a sub-text string already stored in the dictionary tree as a stored text string, and the second part is a sub-text string not stored in the dictionary tree as an unstored text string.
For example, on the basis of the dictionary tree shown in fig. 4, when there is a new text string to be stored as "C fried chicken", since "C fried chicken" is already stored in the dictionary tree, the text string to be stored is "C fried chicken", the stored text string is "C fried chicken", and the non-stored text string is "block".
It should be noted that, for any text string to be stored, both the stored text string and the text string not stored may be empty, when the stored text string is empty, it is stated that the text characters included in each child node of the root node in the dictionary tree do not include the initial text character of the text string to be stored, and when the text string not stored is empty, it is stated that the text string to be stored is stored in the dictionary tree in advance.
S502: and sequentially storing the text characters in the composition sequence of the text characters in the non-stored text character string from the initial text character of the non-stored text character string.
In this step, the method may be performed when the unstored text character string is not empty, the composition sequence of each text character in the unstored text character string is the precedence sequence of each text character in the unstored text character string, for example, the text character string to be stored is "C popcorn", the unstored text character string is "popcorn", and the composition sequence thereof is "pop-m-flower", so that "pop" is stored first, then "m" is stored, and finally "flower" is stored.
In the entity word storage method shown in fig. 5 provided in the embodiment of the present invention, the sub-text character strings in the text character string to be stored, which are stored in the dictionary tree, may be determined as the stored text character strings, the sub-text character strings in the text character string dictionary tree to be stored are determined as the non-stored text character strings, and the text characters are sequentially stored according to the composition sequence of the text characters in the non-stored text character strings from the initial text character of the non-stored text character strings.
On the basis of the foregoing embodiment, an embodiment of the present invention further provides another entity word storage method, configured to implement step S502 in the entity word storage method shown in fig. 5, where a starting text character of a text character string that is not stored is taken as an unstored starting text character, and as shown in fig. 6, the method includes:
s601: and in the text character string to be stored, determining the offset parameter of the previous text character which is not stored with the initial text character as the target offset parameter.
In this step, the Trie may be a Double-Array Trie (Double Array Trie), each node included in the Trie is represented by two parameters, namely, a shift parameter (base) and a check parameter (check), where the shift parameter of one node is used to represent the size of a shift when a child node of the node stores a text character, and the check parameter of one node represents the size of a shift when the node stores a text character, so that the check parameter of one node is equal to the shift parameter of a parent node of the node.
It should be noted that, the check parameters of the child nodes of a node are the same, that is, the nodes with the same check parameters have the same parent node.
Specifically, for the root node, the offset parameter and the check parameter may be referred to in advance, for example, the offset parameter of the root node is 0, and the check parameter is 1.
For a node, it may not have an offset parameter when it does not have a child node, and the offset parameter of the node may be determined only when a child node needs to be generated on the basis of the node.
Thus, the offset parameter for its parent node may or may not have existed in advance for storing the starting text character.
Therefore, optionally, in the case that a predetermined offset parameter already exists in a previous text character in which the start text character is not stored, the predetermined offset parameter is used as the target offset parameter;
the fact that the pre-determined offset parameter exists in the previous text character in which the starting text character is not stored indicates that the starting text character is not stored, so that the offset parameter used when storing the text character contained in the sibling node can be used as the target offset parameter.
Optionally, in a case where there is no predetermined offset parameter in a previous text character in which the start text character is not stored, the offset parameter with the smallest value that is not used is determined as the target offset parameter.
If there is no predetermined offset parameter in the previous text character stored without the stored starting text character, it indicates that there is no sibling node in the initial text character, and at this time, the offset parameter with the smallest value that is not used may be selected as the target offset parameter, or an offset parameter may be randomly selected from among the offset parameters that are used as the target offset parameter.
S602: based on the target offset parameter, a storage location where the starting text character is not stored is determined as the target storage location.
In this step, since the offset parameter of a node is used to indicate the offset when the child node of the node stores a text character, after the target offset parameter is determined, the offset required for the initial text character which is not stored is determined, and then the target storage location where the initial text character is not stored can be determined.
In one embodiment, a sum of the target offset parameter and the encoding value of the unstored start text character may be calculated as a preprocessed value, a storage index of the unstored start text character is determined as a target storage index based on the preprocessed value, and finally a storage location associated with the target storage index is determined as a target storage location based on a pre-established association relationship between the storage index and the storage location.
The encoding value of the initial text character may be a value corresponding to a Unicode code in which the initial text character is not stored, for example, the value corresponding to a Unicode code of a text character "one" is 19968, and the value corresponding to a Unicode code of a text character "top" is 20030.
Because the Unicode code of each text character has uniqueness, the preprocessed numerical value obtained by adding the target offset parameter and the coded numerical value of the initial text character which is not stored can ensure that the preprocessed numerical values of all child nodes of the same father node are different.
Further, the determining of the storage index not storing the initial text character based on the preprocessed numerical value as the target storage index may be implemented by adding a preset numerical value to the preprocessed numerical value, where the preset numerical value is used to avoid a conflict with the root node, and thus the size of the preset numerical value may be the same as the offset parameter of the root node.
Illustratively, the offset parameter base [ root ] + char [1] +1 ═ 1+49+1 ═ 51 is stored in the root node, where the offset parameter base [ root ] +1 of the root node, the unstored start text character "1", the Unicode code of 1 known in advance is 49 and designated by char [1], and the preset value is 1.
Optionally, the association relationship between the storage indexes and the storage locations may be determined according to actual requirements, and each storage index is associated with one storage location.
In one embodiment, the storage locations in the trie are located in a plurality of pre-divided storage blocks, and each of the storage blocks may be a storage block (block), and when a storage block is full, a new storage block may be generated to store data, and the storage block may be a 256-bit storage block.
S603: according to the target storage location, the non-stored starting text character is stored.
In this step, the initial text character which is not stored may be stored in the target storage location, and it can be understood that, at this time, the node in the dictionary tree and including the initial text character which is not stored is included.
In one embodiment, the non-stored text string and the non-stored starting text character are updated, i.e., the non-stored starting text character is deleted from the original non-stored text string and the next text character of the non-stored starting text character is treated as the new non-stored starting text character.
Illustratively, the unstored text string is "fried chicken wings," and when the unstored starting text character "fried" is stored, the new unstored text string is "chicken wings," and the new unstored starting text character is "chicken.
Further, S601 may be executed back until each text character in the text character string that is not stored is stored.
In the entity word storage method shown in fig. 6 provided in the embodiment of the present invention, in the text character string to be stored, an offset parameter of a previous text character in which the start text character is not stored is determined as a target offset parameter, a storage location in which the start text character is not stored is determined as a target storage location based on the target offset parameter, and the start text character is not stored is stored according to the target storage location.
In an embodiment, on the basis of the above-mentioned embodiment, in another dictionary tree diagram shown in fig. 7, after the end text character in the text character string to be stored is completely stored, an entity word identification node (a node containing a "mark" in the diagram) may be added on the basis of the node where the end text character is located, where the entity word identification node is used to indicate that the text character contained in its parent node is the end text character.
In one embodiment, in order to facilitate subsequent recognition of the end text character of each stored entity word, the entity word identification node offset parameter may be set to the maximum negative number of the unoccupied numerical value, the coding numerical value corresponding to the entity word identification node is set to the negative number of the preset numerical value, and the offset parameter of the end text character is set to be the difference between the storage index of the end node and the verification parameter plus the preset numerical value.
In one embodiment, the entity words to be stored include: the entity word types of the text strings are typical types, and the corresponding type text character is 1, and the text strings of 1, 1 and 1 are to be stored.
The Unicode code for each text character is given first, see table 2:
TABLE 2
Character(s) 1 A Lifting device Become into Name (name) II Number (C)
Unicode code 49 19968 20030 25104 21517 20108 21495
In order to more clearly and intuitively understand the variation of the offset parameter and the calibration parameter of each text character, a record table as shown in table 3 is established:
TABLE 3
char x1
i 0
base 1
check 0
The system comprises a character row of a text to be stored, a subsequent symbol char [ ] representing a Unicode code, an i row storing index row, a base row deviating parameter row and a check row checking parameters. X1 represents a root node, which is null, and its corresponding i [ X1] ═ 0, base [ X1] ═ 1, check [ X1] ═ 0, the preset value is 1, and the calculation formula of the offset parameter for the end text node is set to base [ ] i [ -check [ + 1.
When storing text character 1, one can calculate:
i[1]=base[x1]+char[1]+1=1+49+1=51;check[1]=base[x1]=1。
table 4 was obtained:
TABLE 4
char x1 1
i 0 51
base 1
check 0 1
When the text character "one" needs to be stored, since the offset parameter of the text character "1" does not exist, it is necessary to determine base [1] first, since the value 1 is already occupied by base [ x1], and therefore, the minimum value that is not used is 2, then base [1] is 2, and then:
i [ one ] + char [ one ] +1 ═ 2+19968+1 ═ 19971; check [ one ] ═ base [1] ═ 2.
Table 5 was obtained:
TABLE 5
char x1 1 A
i 0 51 19971
base 1 2
check 0 1 2
Because the text character "two" is connected in parallel with the text character "one", the text character "two" can be stored firstly, and because the text character "1" offset parameter base [1] exists, the target offset parameter of the text character "two" is base [1], and then:
i [ di ] + char [ di ] +1 ═ 2+20108+1 ═ 20111; check [ two ] ═ base [1] ═ 2.
Table 6 was obtained:
TABLE 6
Figure BDA0002383788580000171
Figure BDA0002383788580000181
When the text character "top" needs to be stored, since the offset parameter of the text character "one" does not exist, base [ one ] needs to be determined first, since the values 1 and 2 are already occupied by base [ x1] and base [1], so that the minimum value that is not used is 3, and then base [ one ] is 3, then:
i [ case ] + char [ case ] +1 ═ 3+20030+1 ═ 20034; check [ ex ] ═ base [ mono ] ═ 3.
Since the text character "maotai" is the terminal text character of the entity word "maotai", the entity word identification node x2 can be newly created after the node where the text character "maotai" is located, and at this time, it can be determined that base [ maotai ] - "check [ maotai ] +1 [ 20032 ], then:
i [ x2] + char [ x2] +1 ═ 20032+1-1 ═ 20032; check [ x2] base [ ex ] 20032.
Further, since-1 is a negative number whose unoccupied number is the largest, base [ x2] is set to-1.
Table 7 was obtained:
TABLE 7
char x1 1 A II Lifting device x2
i 0 51 19971 20111 20034 20032
base 1 2 3 20032 -1
check 0 1 2 2 3 20032
When it is necessary to store the text character "yes", since the offset parameter of the text character "yes" has already been determined when determining i [ x2] (base [ yes ] ═ 20032), then:
i [ becomes ] + char [ becomes ] +1 ═ 20032+25104+1 ═ 45137; the term "base" is 20032.
Table 8 was obtained:
TABLE 8
Figure BDA0002383788580000182
Figure BDA0002383788580000191
When the text character "first" needs to be stored, since the offset parameter of the text character "first" does not exist, the base [ first ] needs to be determined, since the values 1, 2, 3, and 20032 are already occupied, and therefore, the minimum value that is not used is 4, and the base [ first ] is 4, then:
i [ name ] + char [ name ] +1 ═ 4+21517+1 ═ 21522; check [ name ] ═ base [ product ] ═ 4.
Since the text character "first name" is the end text character of the entity word "first name", the entity word identification node x3 may be newly created after the node where the text character "first name" is located, and at this time, it may be determined that base [ first name ] - [ i ] check [ first name ] +1 ═ 21522-4+1 ═ 21519, then:
i [ x3] + base [ x3] +1 ═ 21522+1-1 ═ 21519; check [ x3] base [ name ] ═ 21519.
Further, since-1 is occupied, then-2 is the negative number with the largest unoccupied value, and then base [ x3] is set to-2.
Table 9 was obtained:
TABLE 9
char x1 1 A II Lifting device x2 Become into Name (name) x3
i 0 51 19971 20111 20034 20032 45137 21522 21519
base 1 2 3 20032 -1 4 21519 -2
check 0 1 2 2 3 20032 20032 4 21519
When the text character "number" needs to be stored, since the offset parameter of the text character "two" does not exist, it is necessary to determine base [ two ] first, since the values 1, 2, 3, 4, 21519, and 20032 are already occupied, and therefore, the minimum value that is not used is 5, and then base [ two ] is 5, then:
i [ + base [ di ] + char [ no ] +1 ═ 5+21495+1 ═ 21501; check [ two ] ═ base [ number ] ═ 5.
Since the text character "no" is the end text character of the entity word "No. two", the entity word identification node x4 may be newly created after the node where the text character "no" is located, and at this time, it may be determined that base [ no ] - [ i ] - [ check [ no ] +1 ═ 21522-4+1 ═ 21479, then:
i [ x4] + char [ x4] +1 ═ 21479+1-1 ═ 21479; check [ x4] ═ base [ No. ], 21479.
Further, since-1 and-2 are occupied, then-3 is the negative number with the largest unoccupied value, and then base [ x4] is set to-3.
Table 10 was obtained:
watch 10
char x1 1 A II Lifting device x2 Become into Name (name) x3 Number (C) x4
i 0 51 19971 20111 20034 20032 45137 21522 21519 21501 21479
base 1 2 3 5 20032 -1 4 21519 -2 21479 -3
check 0 1 2 2 3 20032 20032 4 21519 5 21479
Fig. 8 is a schematic diagram illustrating a storage route of text characters in a dictionary tree in the above embodiment, where the direction indicated by an arrow in the diagram indicates a storage order.
In another embodiment of the present invention, as shown in fig. 9, there is further provided a method for recognizing entity words in a text passage, including:
s901: acquiring text segment data to be recognized, wherein the text segment data comprises the text segment to be recognized and an appointed entity word type, and the appointed entity word type is determined based on received entity word type selection operation.
S902: and determining a type text character corresponding to the entity word type as a specified type text character based on the pre-established corresponding relationship between the entity word type and the type text character.
S903: and recognizing the text segment to be recognized based on the specified type text characters and a pre-established dictionary tree to obtain a recognition result, taking the recognition result as an entity word belonging to the specified entity word type, and using child nodes of a root node in the dictionary tree to store the type text characters.
In the method for recognizing an entity word in a text field as shown in fig. 9, data of a text field to be recognized may be obtained, where the data of the text field includes the text field to be recognized and a specified entity word type, the specified entity word type is determined based on a received entity word type selection operation, and based on a pre-established correspondence between the entity word type and the type text character, a type text character corresponding to the specified entity word type is determined as a specified type text character, and the text field to be recognized is recognized based on the specified type text character and a pre-established dictionary tree, so as to obtain a recognition result, and the recognition result is used as an entity word belonging to the specified entity word type, a child node of a root node in the dictionary tree is used for storing the type text character, because the specified entity word type is determined based on the received entity word type selection operation, the method and the device can identify the text segment to be identified under the specified entity word type, so that the efficiency of identifying the entity words of the specified entity word type in the text segment to be identified is improved.
For the above embodiment, in the step S901, the text segment to be recognized may be a query to be recognized, the entity word type is specified to be determined based on the received entity word type selection operation, and before the query is recognized, an entity word type selection interface may be displayed first, so that the user may select the entity word type to which the entity word to be recognized belongs.
For example, the text segment to be recognized may be a query, such as "how well a ken X-based fried chicken in australia X street" and the specified entity part-of-speech type may be a brand type.
Regarding step S902, the present step is the same as or similar to step S102, and is not repeated herein.
For step S903, since the child nodes of the root node in the dictionary tree are used to store the type text characters, the range of the text segment to be recognized can be determined by specifying the type text characters, the range of the text segment to be recognized to be searched is reduced, and the recognition efficiency is further improved.
In one embodiment, a child node containing a text character of a specified type may be determined as a target node in each child node of a following node of a pre-established dictionary tree, and a recognition result may be obtained by recognizing a text segment to be recognized from the target node.
For example, in fig. 3, the type text character is designated as B, and the text segment to be recognized is "how well a ken X-based fried chicken in australian X street eat", the solid word in "how well a ken X-based fried chicken in australian X street eat" may be recognized from the target node to which the type text character B belongs, so as to obtain the recognition result "ken X base".
Optionally, among the text characters contained in the text segment to be recognized, the text character contained in each subword node of the target node can be determined as the target text character, and if the child nodes of the node to which the target text character belongs include entity word identification nodes, determining the text character string corresponding to the node to which the target text character belongs as a recognition result, wherein, the entity word identification node is used for indicating that the text character contained in the father node of the entity word is the terminal text character of the entity word, and if in the text segment to be recognized, the next text character of the target text character is contained in the child node of the node to which the target text character belongs, the latter text character is used as the target text character and is returned to execute if the child node of the node to which the target text character belongs comprises the entity word identification node, and determining the text character string corresponding to the node to which the target text character belongs as the recognition result.
Based on the same inventive concept, according to the entity word storage method provided in the embodiments of the present invention, an embodiment of the present invention further provides an entity word storage device, as shown in fig. 10, the device includes:
an entity word obtaining module 1001, configured to obtain entity word data to be stored, where the entity word data includes an entity word to be stored and an entity word type to which the entity word to be stored belongs, and the entity word type to which the entity word to be stored belongs is used as the entity word type to be stored;
a first type text character determining module 1002, configured to determine, based on a correspondence between a pre-established entity word type and a type text character, a type text character corresponding to the entity word type to be stored, as a target type text character;
the adding module 1003 is configured to add a target type text character before a starting text character of the entity word to be stored to obtain a text character string serving as the text character string to be stored;
the storage module 1004 is configured to store the text character string to be stored in a pre-established trie, where child nodes of a root node in the trie are used to store the type text character.
Further, the storage module 1004 is specifically configured to determine, as the stored text character string, a sub-text character string in the text character string dictionary tree to be stored, determine, as an un-stored text character string, a sub-text character string in the text character string dictionary tree to be stored, and sequentially store, starting from a starting text character of the un-stored text character string, each text character in the un-stored text character string according to a composition order of each text character in the un-stored text character string.
Further, the initial text character of the text character string is not stored as the initial text character which is not stored;
the storage module 1004 is specifically configured to determine, in the text string to be stored, an offset parameter of a previous text character in which the start text character is not stored, as a target offset parameter, determine, based on the target offset parameter, a storage location in which the start text character is not stored, as a target storage location, store the start text character in which the start text character is not stored according to the target storage location, update the text string in which the start text character is not stored and the start text character in which the start text character is not stored, and return to execute the offset parameter of the previous text character in which the start text character is determined to be not stored until each text character in the text string in which the start text character is not stored is stored.
Further, the storage module 1004 is specifically configured to, in a case that a predetermined offset parameter already exists in a previous text character in which the start text character is not stored, take the predetermined offset parameter as a target offset parameter; alternatively, in the case where there is no predetermined offset parameter in the previous text character in which the start text character is not stored, the offset parameter having the smallest numerical value that is not used is determined as the target offset parameter.
Further, the storage module 1004 is specifically configured to calculate a sum of the target offset parameter and an encoding numerical value of the non-stored starting text character as a preprocessed numerical value, determine, based on the preprocessed numerical value, a storage index of the non-stored starting text character as a target storage index, and determine, based on an association relationship between the storage index and the storage location established in advance, a storage location associated with the target storage index as a target storage location.
Further, the storage module 1004 is further configured to add an entity word identification node on the basis of the node where the end text character is located after the end text character in the text character string to be stored is stored, where the entity word identification node is used to indicate that the text character included in the parent node of the entity word identification node is the end text character.
Further, each storage position in the dictionary tree is located in a plurality of storage blocks divided in advance.
Based on the same inventive concept, according to the method for recognizing the entity word in the text segment provided by the embodiment of the present invention, the embodiment of the present invention further provides a device for recognizing the entity word in the text segment, as shown in fig. 11, the device includes:
a text segment data obtaining module 1101, configured to obtain text segment data to be recognized, where the text segment data includes a text segment to be recognized and a specified entity word type, and the specified entity word type is determined based on a received entity word type selection operation;
a second type text character determining module 1102, configured to determine, based on a pre-established correspondence between an entity word type and a type text character, a type text character corresponding to the entity word type as a specified type text character;
a recognition module 1103, configured to recognize a text segment to be recognized based on the specified type text characters and a pre-established trie to obtain a recognition result, and use the recognition result as an entity word belonging to the specified entity word type, where child nodes of a root node in the trie are used to store the type text characters
Further, the identifying module 1103 is specifically configured to determine, as a target node, a child node that includes a text character of a specified type in each child node of a following node of a pre-established dictionary tree, and identify, starting from the target node, a text segment to be identified, so as to obtain an identification result.
Furthermore, the recognition module 1103 is specifically configured to, among the text characters contained in the text segment to be recognized, determining a text character contained in each subword node of the target node as a target text character, and if the child nodes of the node to which the target text character belongs include entity word identification nodes, determining the text character string corresponding to the node to which the target text character belongs as a recognition result, wherein, the entity word identification node is used for indicating that the text character contained in the father node of the entity word is the terminal text character of the entity word, and if in the text segment to be recognized, the next text character of the target text character is contained in the child node of the node to which the target text character belongs, and taking the next text character as a target text character, and returning to execute that if the child node of the node to which the target text character belongs comprises the entity word identification node, the text character string corresponding to the node to which the target text character belongs is determined as a recognition result.
An embodiment of the present invention further provides an electronic device, as shown in fig. 12, including a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204, where the processor 1201, the communication interface 1202, and the memory 1203 complete mutual communication through the communication bus 1204,
a memory 1203 for storing a computer program;
the processor 1201 is configured to implement the steps of the entity word storage method or the entity word recognition storage method in the text segment when executing the program stored in the memory 1203.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
In another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any one of the entity word storage methods or the entity word recognition storage method in a text segment.
In another embodiment of the present invention, there is also provided a computer program product including instructions, which when executed on a computer, cause the computer to execute any one of the entity word storage methods or the entity word recognition storage method in a text passage in the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. An entity word storage method, comprising:
acquiring entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored;
determining type text characters corresponding to the entity word types to be stored as target type text characters based on a corresponding relation between the entity word types and the type text characters established in advance;
adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string as the text character string to be stored;
and storing the text character string to be stored into a pre-established dictionary tree, wherein child nodes of a root node in the dictionary tree are used for storing type text characters.
2. The method according to claim 1, wherein the storing the text string to be stored into a pre-established dictionary tree comprises:
determining sub-text character strings of the text character strings to be stored, which are stored in the dictionary tree, as stored text character strings, and determining sub-text character strings of the text character strings to be stored, which are not stored in the dictionary tree, as non-stored text character strings;
and sequentially storing the text characters from the initial text character of the text character string which is not stored according to the composition sequence of the text characters in the text character string which is not stored.
3. The method of claim 2, wherein a starting text character of the unstored text string is treated as an unstored starting text character;
the sequentially storing the text characters from the initial text character of the text character string not stored according to the composition sequence of the text characters in the text character string not stored comprises:
determining the offset parameter of the previous text character stored in the initial text character which is not stored as a target offset parameter in the text character string to be stored;
determining a storage position where the starting text character is not stored as a target storage position based on the target offset parameter;
storing the initial text characters which are not stored according to the target storage position;
and updating the unstored text character string and the unstored initial text character, and returning to execute the step of determining the offset parameter of the previous text character stored in the unstored initial text character until each text character in the unstored text character string is stored.
4. The method of claim 3, wherein determining the offset parameter of the previous text character for which the starting text character is not stored as the target offset parameter comprises:
taking the predetermined offset parameter as the target offset parameter in the case that a predetermined offset parameter already exists in the previous text character in which the initial text character is not stored; alternatively, the first and second electrodes may be,
and determining the offset parameter with the smallest numerical value which is not used as the target offset parameter in the case that the offset parameter which is determined in advance does not exist in the previous text character in which the initial text character is not stored.
5. The method according to claim 3 or 4, wherein the determining, as a target storage location, the storage location where the starting text character is not stored based on the target offset parameter comprises:
calculating the sum of the target offset parameter and the coding numerical value of the unstored initial text character as a preprocessing numerical value;
determining a storage index of the initial text character which is not stored as a target storage index based on the preprocessing numerical value;
and determining a storage position associated with the target storage index as the target storage position based on the pre-established association relationship between the storage index and the storage position.
6. The method of any of claims 1-4, wherein each storage location in the trie is in a pre-partitioned plurality of storage blocks.
7. A method for recognizing entity words in a text segment is characterized by comprising the following steps:
acquiring text segment data to be recognized, wherein the text segment data comprises a text segment to be recognized and a specified entity word type, and the specified entity word type is determined based on received entity word type selection operation;
determining type text characters corresponding to the entity word types as specified type text characters based on a pre-established corresponding relationship between the entity word types and the type text characters;
and recognizing the text segment to be recognized based on the specified type text characters and a pre-established dictionary tree to obtain a recognition result, taking the recognition result as an entity word belonging to the specified entity word type, wherein child nodes of a root node in the dictionary tree are used for storing the type text characters.
8. An entity word storage device, comprising:
the entity word obtaining module is used for obtaining entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored;
the first type text character determining module is used for determining type text characters corresponding to the entity word types to be stored as target type text characters based on the corresponding relationship between the entity word types and the type text characters established in advance;
the adding module is used for adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string which is used as the text character string to be stored;
and the storage module is used for storing the text character string to be stored into a pre-established dictionary tree, and the child nodes of the root node in the dictionary tree are used for storing the type text characters.
9. An apparatus for recognizing entity words in a text segment, comprising:
the system comprises a text segment data acquisition module, a text segment data acquisition module and a text segment recognition module, wherein the text segment data acquisition module is used for acquiring text segment data to be recognized, the text segment data comprises a text segment to be recognized and a specified entity word type, and the specified entity word type is determined based on received entity word type selection operation;
the second type text character determining module is used for determining type text characters corresponding to the entity word types based on the pre-established corresponding relationship between the entity word types and the type text characters as specified type text characters;
and the recognition module is used for recognizing the text segment to be recognized based on the specified type text characters and a pre-established dictionary tree to obtain a recognition result, the recognition result is used as an entity word belonging to the specified entity word type, and child nodes of a root node in the dictionary tree are used for storing the type text characters.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-6 or 7 when executing a program stored in the memory.
CN202010091208.4A 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment Active CN111309851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010091208.4A CN111309851B (en) 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010091208.4A CN111309851B (en) 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111309851A true CN111309851A (en) 2020-06-19
CN111309851B CN111309851B (en) 2023-09-19

Family

ID=71144977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010091208.4A Active CN111309851B (en) 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111309851B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000151419A (en) * 1998-11-05 2000-05-30 Asahi Chem Ind Co Ltd Data compression method and data compression unit
US20030088577A1 (en) * 2001-07-20 2003-05-08 Surfcontrol Plc, Database and method of generating same
CN1987848A (en) * 2005-12-22 2007-06-27 国际商业机器公司 Character string processing method and apparatus
CN102227725A (en) * 2008-12-02 2011-10-26 艾利森电话股份有限公司 System and method for matching entities
US20130325439A1 (en) * 2012-05-31 2013-12-05 International Business Machines Corporation Disambiguating words within a text segement
KR20140123884A (en) * 2013-07-19 2014-10-23 주식회사 큐키 Type error correction method
CN105512118A (en) * 2014-09-22 2016-04-20 珠海金山办公软件有限公司 User demand feedback method and device
EP3306823A1 (en) * 2016-10-06 2018-04-11 Fujitsu Limited Encoding program, encoding apparatus and encoding method
CN108334491A (en) * 2017-09-08 2018-07-27 腾讯科技(深圳)有限公司 Text analyzing method, apparatus, computing device and storage medium
CN108874774A (en) * 2018-06-05 2018-11-23 浪潮软件股份有限公司 A kind of service calling method and system based on intention understanding
CN109213844A (en) * 2018-08-13 2019-01-15 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110262674A (en) * 2019-06-27 2019-09-20 北京金山安全软件有限公司 Chinese character input method and device based on pinyin input and electronic equipment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000151419A (en) * 1998-11-05 2000-05-30 Asahi Chem Ind Co Ltd Data compression method and data compression unit
US20030088577A1 (en) * 2001-07-20 2003-05-08 Surfcontrol Plc, Database and method of generating same
CN1987848A (en) * 2005-12-22 2007-06-27 国际商业机器公司 Character string processing method and apparatus
CN102227725A (en) * 2008-12-02 2011-10-26 艾利森电话股份有限公司 System and method for matching entities
US20130325439A1 (en) * 2012-05-31 2013-12-05 International Business Machines Corporation Disambiguating words within a text segement
KR20140123884A (en) * 2013-07-19 2014-10-23 주식회사 큐키 Type error correction method
CN105512118A (en) * 2014-09-22 2016-04-20 珠海金山办公软件有限公司 User demand feedback method and device
EP3306823A1 (en) * 2016-10-06 2018-04-11 Fujitsu Limited Encoding program, encoding apparatus and encoding method
CN108334491A (en) * 2017-09-08 2018-07-27 腾讯科技(深圳)有限公司 Text analyzing method, apparatus, computing device and storage medium
CN108874774A (en) * 2018-06-05 2018-11-23 浪潮软件股份有限公司 A kind of service calling method and system based on intention understanding
CN109213844A (en) * 2018-08-13 2019-01-15 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110262674A (en) * 2019-06-27 2019-09-20 北京金山安全软件有限公司 Chinese character input method and device based on pinyin input and electronic equipment

Also Published As

Publication number Publication date
CN111309851B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN108496190B (en) Annotation system for extracting attributes from electronic data structures
CN110532451A (en) Search method and device for policy text, storage medium, electronic device
CN105630938A (en) Intelligent question-answering system
CN109906450A (en) For the method and apparatus by similitude association to electronic information ranking
CN112100396B (en) Data processing method and device
WO2014210387A2 (en) Concept extraction
CN111831911A (en) Query information processing method and device, storage medium and electronic device
CN109165040B (en) Code plagiarism suspicion detection method based on random forest model
CN109582847B (en) Information processing method and device and storage medium
WO2020215675A1 (en) Method and apparatus for building medical treatment database, and computer device and storage medium
CN106844482B (en) Search engine-based retrieval information matching method and device
CN111488385A (en) Data processing method and device based on artificial intelligence and computer equipment
CN110019751A (en) Machine learning model modification and natural language processing
CN112100202A (en) Product identification and product information completion method, storage medium and robot
CN113642311A (en) Data comparison method and device, electronic equipment and storage medium
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN111460808A (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN111309851A (en) Entity word storage method and device and electronic equipment
CN111859146B (en) Information mining method and device and electronic equipment
JP6260678B2 (en) Information processing apparatus, information processing method, and information processing program
CN114282119A (en) Scientific and technological information resource retrieval method and system based on heterogeneous information network
CN115292478A (en) Method, device, equipment and storage medium for recommending search content
CN110457455B (en) Ternary logic question-answer consultation optimization method, system, medium and equipment
CN109408704B (en) Fund data association method, system, computer device and storage medium
CN112541357A (en) Entity identification method and device and intelligent equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant