CN111309851B - Entity word storage method and device and electronic equipment - Google Patents

Entity word storage method and device and electronic equipment Download PDF

Info

Publication number
CN111309851B
CN111309851B CN202010091208.4A CN202010091208A CN111309851B CN 111309851 B CN111309851 B CN 111309851B CN 202010091208 A CN202010091208 A CN 202010091208A CN 111309851 B CN111309851 B CN 111309851B
Authority
CN
China
Prior art keywords
text
stored
text character
entity word
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010091208.4A
Other languages
Chinese (zh)
Other versions
CN111309851A (en
Inventor
许晏铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Internet Security Software Co Ltd
Original Assignee
Beijing Kingsoft Internet Security Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Internet Security Software Co Ltd filed Critical Beijing Kingsoft Internet Security Software Co Ltd
Priority to CN202010091208.4A priority Critical patent/CN111309851B/en
Publication of CN111309851A publication Critical patent/CN111309851A/en
Application granted granted Critical
Publication of CN111309851B publication Critical patent/CN111309851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a device for storing entity words and electronic equipment, comprising the following steps: the method comprises the steps of obtaining entity word data to be stored, determining type text characters corresponding to the entity word types to be stored based on the corresponding relation between the entity word types and the type text characters, adding target type text characters before initial text characters of the entity words to be stored to obtain text character strings, and storing the text character strings to be stored in a pre-established dictionary tree.

Description

Entity word storage method and device and electronic equipment
Technical Field
The present invention relates to the field of data storage technologies, and in particular, to a method and an apparatus for storing entity words, and an electronic device.
Background
With the popularization of the internet and the development of internet technology, in daily life and work, people prefer to search for solutions of related problems through the internet or detailed information and purchasing links when they want to find a certain favorite item, for example, when users are interested in travel, they may search for "how good a chicken of the ken X group of the australia street is? ".
Technically, the query is used for representing the content searched by the user, and the subsequent NLP (Natural Language Processing ) architecture module can be convenient for understanding the search intention of the user by identifying entity words in the query searched by the user. The entity words in the query refer to entities with specific meanings in text characters, including names of people, places, organizations, proper nouns and the like, and characters such as time, quantity, currency, proportion values and the like.
Conventionally, in order to identify entity words in a query, an entity word matching manner is generally adopted, in brief, an entity word database is constructed by collecting entity words in advance, when the entity words in the query need to be identified, words contained in the query are searched in the entity word database, and for example, for the "ken X-based fried chicken of the query" the ken X-street "is searched in the database, the entity words" the ken X-street "," the ken X-base "and" fried chicken "of the query are searched, and then the entity words of the query include" the ken X-street "and" the ken X-base "and" fried chicken ".
The inventors have found that in the process of implementing the present invention, at least the following problems exist in the prior art:
in a query, there may be included a plurality of types of entity words, for example, in the above query, an entity word of a location type in "Australian X street" and an entity word of a brand type and a food type in "ken X base" and an entity word of a food type in "fried chicken". For different analysis requirements, the requirements on the types of the entity words are different, for example, when the brand requirements of the user are researched, only entity words belonging to the brand types in the query are needed, and the entity words of various entity word types in the query are possibly identified and processed by adopting the prior art, so that the identified entity words are needed to be searched continuously to obtain the entity words belonging to the appointed entity word types, and the efficiency is lower.
Disclosure of Invention
The embodiment of the invention aims to provide an entity word storage method for improving the efficiency of identifying entity words of the entity word type specified in a query. The specific technical scheme is as follows:
the embodiment of the invention provides a method for storing entity words, which comprises the following steps:
acquiring entity word data to be stored, wherein the entity word data comprises an entity word to be stored and an entity word type to which the entity word to be stored belongs, and the entity word type to which the entity word to be stored belongs is used as the entity word type to be stored;
based on a pre-established corresponding relation between the entity word type and the type text characters, determining the type text characters corresponding to the entity word type to be stored as target type text characters;
adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string which is used as the text character string to be stored;
and storing the text character strings to be stored into a pre-established dictionary tree, wherein the child nodes of the root node in the dictionary tree are used for storing the type text characters.
Further, the storing the text character string to be stored in a pre-established dictionary tree includes:
Determining the sub-text character strings stored in the dictionary tree in the text character strings to be stored as the stored text character strings, and determining the sub-text character strings in the dictionary tree in the text character strings to be stored as the non-stored text character strings;
and starting from the initial text character of the non-stored text character string, sequentially storing the text characters according to the composition sequence of the text characters in the non-stored text character string.
Further, the initial text character of the non-stored text character string is used as the non-stored initial text character;
starting from the initial text character of the non-stored text character string, storing each text character in sequence according to the composition sequence of each text character in the non-stored text character string, and comprising the following steps:
determining an offset parameter of a previous text character stored in the non-stored initial text character in the text character string to be stored as a target offset parameter;
determining a storage position of the non-stored initial text character as a target storage position based on the target offset parameter;
storing the non-stored initial text characters according to the target storage position;
Updating the non-stored text string and the non-stored initial text character, and returning to the step of determining the offset parameter of the previous text character in which the non-stored initial text character is stored until each text character in the non-stored text string is stored.
Further, the determining the offset parameter of the previous text character stored in the non-stored initial text character as the target offset parameter includes:
in the case that a predetermined offset parameter exists in the previous text character stored in the non-stored initial text character, taking the predetermined offset parameter as the target offset parameter; or,
and determining an offset parameter with the smallest unused value as the target offset parameter when the offset parameter is not determined in advance in the previous text character stored in the non-stored initial text character.
Further, the determining, based on the target offset parameter, a storage location of the non-stored starting text character as a target storage location includes:
calculating the sum of the target offset parameter and the coding value of the non-stored initial text character as a preprocessing value;
Determining a storage index of the non-stored initial text character as a target storage index based on the preprocessing value;
and determining a storage position associated with the target storage index as the target storage position based on the association relation between the pre-established storage index and the storage position.
Further, the method further comprises:
after the end text characters in the text character string to be stored are stored, adding entity word identification nodes on the basis of the nodes where the end text characters are located, wherein the entity word identification nodes are used for indicating that the text characters contained in the father nodes are the end text characters.
Further, each storage position in the dictionary tree is located in a plurality of storage blocks divided in advance.
The embodiment of the invention provides a method for identifying entity words in text segments, which comprises the following steps:
acquiring text segment data to be identified, wherein the text segment data comprises text segments to be identified and appointed entity word types, and the appointed entity word types are determined based on received entity word type selection operation;
based on a pre-established corresponding relation between the entity word type and the type text characters, determining the type text characters corresponding to the entity word type as specified type text characters;
And identifying the text segment to be identified based on the text characters of the appointed type and a pre-established dictionary tree to obtain an identification result, wherein the identification result is used as an entity word belonging to the type of the appointed entity word, and a child node of a root node in the dictionary tree is used for storing the text characters of the type.
Further, the identifying the text segment to be identified based on the text character of the specified type and a pre-established dictionary tree to obtain an identification result includes:
determining a child node containing the text characters of the appointed type from all child nodes of the follow node of the pre-established dictionary tree as a target node;
and starting from the target node, identifying the text segment to be identified to obtain an identification result.
Further, the step of identifying the text segment to be identified from the target node to obtain an identification result includes:
determining text characters contained in each sub-word node of the target node from all text characters contained in the text segment to be identified as target text characters;
if the child node of the node to which the target text character belongs comprises an entity word identification node, determining a text character string corresponding to the node to which the target text character belongs as a recognition result, wherein the entity word identification node is used for indicating that the text character contained in the parent node is a terminal text character of an entity word;
And if the text segment to be recognized contains the next text character of the target text character in the child node of the node to which the target text character belongs, returning the next text character as the target text character to execute the step of determining that the text character string corresponding to the node to which the target text character belongs is a recognition result if the child node of the node to which the target text character belongs contains the entity word identification node.
The embodiment of the invention also provides a device for storing entity words, which comprises:
the entity word acquisition module is used for acquiring entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored;
the first type text character determining module is used for determining type text characters corresponding to the entity word types to be stored as target type text characters based on the corresponding relation between the pre-established entity word types and the type text characters;
the adding module is used for adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string which is used as the text character string to be stored;
And the storage module is used for storing the text character strings to be stored into a pre-established dictionary tree, and the child nodes of the root node in the dictionary tree are used for storing the type text characters.
Further, the storage module is specifically configured to determine a sub-text string stored in the dictionary tree in the text string to be stored as a stored text string, determine a sub-text string in the dictionary tree in the text string to be stored as an unrecorded text string, and sequentially store each text character according to a composition sequence of each text character in the unrecorded text string starting from a starting text character of the unrecorded text string.
Further, the initial text character of the non-stored text character string is used as the non-stored initial text character;
the storage module is specifically configured to determine, in the text string to be stored, an offset parameter of a previous text character in which the initial text character is stored as a target offset parameter, and determine, based on the target offset parameter, a storage location of the initial text character to be stored as a target storage location, and store the initial text character to be stored and the initial text character to be stored according to the target storage location, and update the initial text string to be stored and the initial text character to be stored, and return to executing the offset parameter of the previous text character in which the initial text character to be stored is determined until each text character in the initial text string to be stored.
Further, the storage module is specifically configured to take the predetermined offset parameter as the target offset parameter when a predetermined offset parameter already exists in a previous text character stored in the non-stored initial text character; or determining an offset parameter with the smallest unused value as the target offset parameter when the predetermined offset parameter does not exist in the previous text character stored in the non-stored initial text character.
Further, the storage module is specifically configured to calculate a sum of the target offset parameter and the encoding value of the non-stored starting text character as a preprocessing value, determine a storage index of the non-stored starting text character as a target storage index based on the preprocessing value, and determine a storage location associated with the target storage index as the target storage location based on a pre-established association relationship between the storage index and the storage location.
Further, the storage module is further configured to add an entity word identification node on the basis of the node where the terminal text character is located after the terminal text character in the text character string to be stored is stored, where the entity word identification node is used for indicating that the text character included in the parent node is the terminal text character.
Further, each storage position in the dictionary tree is located in a plurality of storage blocks divided in advance.
The embodiment of the invention also provides a device for identifying the entity words in the text segment, which comprises the following steps:
the text segment data acquisition module is used for acquiring text segment data to be identified, wherein the text segment data comprises text segments to be identified and specified entity word types, and the specified entity word types are determined based on the received entity word type selection operation;
the second type text character determining module is used for determining the type text characters corresponding to the entity word types based on the corresponding relation between the pre-established entity word types and the type text characters, and taking the type text characters as the appointed type text characters;
the recognition module is used for recognizing the text segment to be recognized based on the text characters of the appointed type and a pre-established dictionary tree to obtain a recognition result, wherein the recognition result is used as an entity word belonging to the appointed entity word type, and a child node of a root node in the dictionary tree is used for storing the text characters of the type
Further, the recognition module is specifically configured to determine, among child nodes of the heel node of the pre-established dictionary tree, a child node containing the text character of the specified type as a target node, and from the target node, recognize the text segment to be recognized, so as to obtain a recognition result.
Further, the identifying module is specifically configured to determine, from among text characters included in the text segment to be identified, a text character included in each sub-word node of the target node as a target text character, and if an entity word identification node is included in a sub-node of a node to which the target text character belongs, determine that a text character string corresponding to the node to which the target text character belongs is an identification result, where the entity word identification node is a terminal text character for indicating that a text character included in a parent node is an entity word, and if a next text character of the target text character is included in a sub-node of the node to which the target text character belongs in the text segment to be identified, take the next text character as the target text character, and return to execute that the sub-node of the node to which the target text character belongs includes the entity word identification node, and determine that the text character string corresponding to the node to which the target text character belongs is an identification result.
The embodiment of the invention also provides electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface, and the memory are communicated with each other through the communication bus;
A memory for storing a computer program;
and the processor is used for realizing the step of the entity word identification method in any entity word storage or text segment when executing the program stored in the memory.
The implementation of the present invention also provides a computer readable storage medium, where a computer program is stored in the computer readable storage medium, where the computer program when executed by a processor implements the steps of any of the above-mentioned entity word storage methods or the entity word recognition storage method in a text segment.
The embodiment of the invention also provides a computer program product containing instructions, which when run on a computer, cause the computer to execute the method for identifying entity words in any entity word storage or text segment.
According to the method, the device and the electronic equipment for storing the entity words, the entity word data to be stored are obtained, the entity word data comprise the entity words to be stored and the entity word types to which the entity words to be stored belong, the entity word types to which the entity words to be stored belong are used as the entity word types to be stored, the type text characters corresponding to the entity word types to be stored are determined based on the corresponding relation between the pre-established entity word types and the type text characters and are used as target type text characters, the target type text characters are added before the initial text characters of the entity words to be stored, text character strings are obtained and are used as the text character strings to be stored, the sub-nodes of the root nodes in the dictionary tree are used for storing the type text characters, and as the entity words stored in the dictionary tree have the entity word characters capable of distinguishing the entity word types, a storage structure capable of rapidly distinguishing the entity word types in the query is provided, and the efficiency of distinguishing the entity words of the entity word types in the query is improved.
Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow chart of a method for storing entity words according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a typical dictionary tree;
FIG. 3 is a schematic diagram of a dictionary tree provided by one embodiment of the present invention;
FIG. 4 is a schematic diagram of a dictionary tree provided in another embodiment of the present invention;
FIG. 5 is a flowchart of a method for storing entity words according to another embodiment of the present invention;
FIG. 6 is a flowchart of a method for storing entity words according to another embodiment of the present invention;
FIG. 7 is a schematic diagram of a dictionary tree provided in accordance with another embodiment of the present invention;
FIG. 8 is a schematic diagram of a stored route for text characters in a dictionary tree provided by one embodiment of the present invention;
FIG. 9 is a flowchart of a method for recognizing entity words in text paragraphs according to one embodiment of the present invention;
FIG. 10 is a schematic diagram of a physical word storage device according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a method for recognizing entity words in text paragraphs according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to provide an implementation scheme for improving the efficiency of identifying entity words of a specified entity word type in a query, an embodiment of the application provides an entity word storage method, an entity word storage device and electronic equipment, and the embodiment of the application is described below with reference to the accompanying drawings of the specification. And embodiments of the application and features of the embodiments may be combined with each other without conflict.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
In one embodiment of the present application, there is provided a method for storing entity words, as shown in fig. 1, the method comprising the steps of:
s101: and acquiring entity word data to be stored, wherein the entity word data comprises an entity word to be stored and an entity word type to which the entity word to be stored belongs, and the entity word type to which the entity word to be stored belongs is used as the entity word type to be stored.
S102: and determining the type text character corresponding to the entity word type to be stored as a target type text character based on the pre-established corresponding relation between the entity word type and the type text character.
S103: and adding a target type text character before the initial text character of the entity word to be stored to obtain a text character string as the text character string to be stored.
S104: and storing the text character strings to be stored into a pre-established dictionary tree, wherein the child nodes of the root node in the dictionary tree are used for storing the type text characters.
In the above method for storing entity words provided in the embodiment of the present invention, entity word data to be stored is obtained, where the entity word data includes entity words to be stored and entity word types to which the entity words to be stored belong, the entity word types to which the entity words to be stored belong are used as entity word types to be stored, and based on a correspondence between pre-established entity word types and type text characters, type text characters corresponding to the entity word types to be stored are determined and used as target type text characters, and the target type text characters are added before initial text characters of the entity words to be stored to obtain text character strings, and the text character strings to be stored are stored in a pre-established dictionary tree, child nodes of root nodes in the dictionary tree are used for storing the type text characters, and because entity words stored in the dictionary tree have entity word characters capable of distinguishing entity word types, a storage structure capable of rapidly distinguishing entity word types from entity words specified in a query is provided, and therefore efficiency of identifying entity words of entity word types specified in the query is improved.
In the above method for storing entity words as shown in fig. 1 provided by the embodiment of the present invention, for step S101, the entity words to be stored included in the entity word data to be stored may be collected in advance, and optionally, the collection of entity words may be implemented in various manners.
For example, for a vertical scenario, a vertical website associated with the vertical scenario may be acquired first, for example, for an automobile scenario, the vertical website associated with the automobile scenario may be a website related to an automobile introduction in the internet, or an official website of an automobile manufacturer, or an automobile transaction website, and for a game scenario, the vertical website associated with the game scenario may be a game forum, or a website providing a game download service, or a question-answer website related to a game, further, text data in the vertical website is extracted, and the text data is processed to obtain each entity word included in the text data.
In order to improve the fact that the text segment of the subsequent search appointed text contains entity words belonging to a specific vertical scene, when the entity words are stored, the entity word types of the entity words to be stored are required to be predetermined.
Optionally, the entity word type of the entity word to be stored may be determined based on the source of the acquired entity word, for example, since text data in a vertical website associated with a vertical scene is more relevant than the vertical domain, the entity word type of the entity word acquired from the vertical website may be set to be the entity word type associated with the vertical website, optionally, in order to accurately determine the entity word type of the collected entity word, a manual labeling manner may be adopted, or a neural network model may be trained to implement the identification of the entity word type.
Optionally, the entity word data to be stored may be recorded in an entity word table to be stored, and the data to be stored is obtained by obtaining the entity word table to be stored, and further, the entity word data to be stored may be recorded in a format of (entity word: entity word type), for example (western-style clothes: clothing type), where the western-style clothes is the entity word to be stored, and the clothing type is the entity word type to which the western-style clothes belongs.
For step S102, one type of text character is used to represent one type of entity word, so that the type text characters corresponding to each entity word type are different from each other, and thus the entity word types can be distinguished only by the type text characters.
The text characters of the types corresponding to each entity word type can be determined according to actual requirements.
For example, if the entity word type is a position type, the text character of the type corresponding to the entity word type may be expressed by words, such as "position", or may be expressed by numbers, such as "1", or may be expressed by letters, such as "a".
Alternatively, the correspondence between the pre-established entity word type and the type text character may be recorded in the form of a correspondence table, which is exemplary as shown in table 1 below:
TABLE 1
Location type Brand type Food type Garment type
A B C D
The text characters of the types corresponding to the entity word types exist in the second behavior in the first behavior and the entity word types in table 1, wherein A, B, C and D can be any text characters, for convenience in storage, serial numbers can be used as the text characters of the types corresponding to the entity word types, and for example, a can be 1, B can be 2, C can be 3 and D can be 4. It should be noted that, the text characters of the types corresponding to the word types of the entities are different from each other.
For step S103, the initial text character of the entity word to be stored is the first text character of the entity word to be stored, and similarly, the final text character of the entity word to be stored is the last text character of the entity word to be stored, for example, the initial text character of "australia X street" and "australia X street" for the entity word to be stored is "australia" and the final text character is "street".
For example, if the entity word to be stored is "australian X street" and the text character of the target type is a, the text character string to be stored is "australian X street" can be obtained.
For step S104, child nodes of the root node in the pre-established dictionary tree are used to store the type text characters.
For easy understanding, first, a simple description is made of the storage manner of the dictionary tree:
as shown in fig. 2, a schematic diagram of a typical dictionary tree in the prior art is shown, where a root node does not include text characters, and each child node except the root node includes a text character, for a node in the dictionary tree, the child nodes of the node include text characters that are different, a node path of the node is a path from the root node to the node, and a text string corresponding to the node is a string formed by text words that pass through the node path of the node according to a path sequence.
In the dictionary tree shown in fig. 2, gray nodes represent that text strings corresponding to the nodes are entity words, and text characters contained in the gray nodes are end text characters of the entity words.
For example, for the dictionary tree shown in fig. 2, the path of the node r is the root node-a-c-d-r, the text string corresponding to the node r is addr, the path of the node f of the parent node is the node c is the root node-a-c-f, the text string corresponding to the parent node is the acf, the path of the node f of the parent node is the node b is the root node-b-f, the text string corresponding to the parent node is the bf, and the dictionary tree shown in fig. 2 stores the entity words ab, acd, acdr, acf, ba and bf.
For each embodiment provided by the application, the child node of the root node in the dictionary tree is not used for storing text characters contained in the entity word to be stored, but is used for storing the type text characters corresponding to the entity word type to which the entity to be stored belongs, so that the node for storing the entity word belonging to the same entity word type is positioned below the node to which the text characters of the same type belong, and the entity word of each entity word type is divided according to the entity word type during storage.
As shown in fig. 3, a schematic diagram of a dictionary tree provided in an embodiment of the present application is shown, in which a root node includes three child nodes, which include type text characters A, B and C, respectively, wherein the type text characters A, B and C represent different entity word types, and under the node to which the type text character a belongs, there is an entity word "australian X street", under the node to which the type text character B belongs, there is an entity word "ken X chicken", under the node to which the type text character C belongs, there is an entity word "fried chicken".
When the entity word to be stored is 'golden X gate', the text character of the type is B, before the initial text character 'golden' of the 'golden X gate', the text character string 'B golden X gate' to be stored is obtained, and the 'B golden X gate' is stored into the dictionary tree diagram shown in figure 3, so that the dictionary tree diagram shown in figure 4 is obtained.
On the basis of the foregoing embodiment, the embodiment of the present invention further provides another entity word storage method, which is used to implement step S104 in the entity word storage method shown in fig. 1, as shown in fig. 5, including:
s501: and determining the sub-text character strings stored in the dictionary tree in the text character strings to be stored as the stored text character strings, and determining the sub-text character strings in the dictionary tree in the text character strings to be stored as the non-stored text character strings.
In this step, for each text string to be stored, the text string to be stored may be divided into two parts, the first part being a sub-text string already stored in the dictionary tree as a stored text string, and the second part being a sub-text string not stored in the dictionary tree as an un-stored text string.
For example, on the basis of the dictionary tree shown in fig. 4, when there is a new text string to be stored as "C-fried chicken", since "C-fried chicken" is already stored in the dictionary tree, for the text string to be stored as "C-fried chicken", the stored text string is "C-fried chicken" and the non-stored text string is "block".
It should be noted that, for any text string to be stored, both the stored text string and the non-stored text string may be empty, when the stored text string is empty, the text character included in each child node of the root node in the description dictionary tree does not include the initial text character of the text string to be stored, and when the non-stored text string is empty, the description text string to be stored is stored in the dictionary tree in advance.
S502: starting from the initial text character of the non-stored text character string, each text character is stored in turn in the order of composition of each text character in the non-stored text character string.
In this step, the method may be performed under the condition that the non-stored text character string is not empty, the composition sequence of each text character in the non-stored text character string is the sequence of each text character in the non-stored text character string, and the text character string to be stored is "C popcorn", and the non-stored text character string is "popcorn", so that the composition sequence is "popcorn-popcorn", so that the "popcorn" is stored first, the "popcorn" is stored later, and the "popcorn" is stored last.
In the method for storing the entity words as shown in fig. 5, the sub-text character strings stored in the dictionary tree in the text character strings to be stored can be determined as the stored text character strings, the sub-text character strings in the dictionary tree in the text character strings to be stored are determined as the non-stored text character strings, and the text characters are sequentially stored according to the composition sequence of the text characters in the non-stored text character strings from the initial text characters of the non-stored text character strings, and only the part of the text character strings which are not stored in the dictionary tree is required to be stored, so that the storage efficiency of the entity words can be improved.
On the basis of the foregoing embodiment, the embodiment of the present invention further provides a method for storing entity words, which is used to implement step S502 in the method for storing entity words shown in fig. 5, where a starting text character of an unstored text character string is used as an unstored starting text character, as shown in fig. 6, and includes:
s601: in the text character string to be stored, an offset parameter of a previous text character in which the initial text character has been stored is determined as a target offset parameter.
In this step, the dictionary tree may be a double-array dictionary tree (Double Array Trie), where each node included in the dictionary tree is represented by two parameters, namely, an offset parameter (base) and a check parameter (check), where the offset parameter of one node is used to represent the offset size when the child node of the node stores a text character, and the check parameter of one node is used to represent the offset size when the node stores a text character, so that the check parameter of one node is equal to the offset parameter of the parent node of the node.
When the verification parameters of all the child nodes of one node are the same, namely the nodes with the same verification parameters have the same father node.
In particular, for the root node, the offset parameter and the check parameter thereof may be referred to in advance, for example, the offset parameter of the root node is 0, and the check parameter is 1.
For a node, when there is no child node, it may not have an offset parameter, and the offset parameter of the node may be determined when it is required to generate a child node on the basis of the node.
Thus, for storing the initial text character, the offset parameter of its parent node may or may not already exist.
Thus, optionally, the predetermined offset parameter is taken as the target offset parameter in the case that there is already a predetermined offset parameter for the previous text character in which the initial text character has been stored is not stored;
the fact that the previous text character in which the non-stored starting text character is stored has a predetermined offset parameter indicates that there is a sibling node of the non-stored starting text character, and thus the offset parameter used when storing the text character contained in the sibling node can be taken as the target offset parameter.
Optionally, in the case that the previously stored text character of the non-stored initial text character does not have a predetermined offset parameter, determining the offset parameter with the smallest unused value as the target offset parameter.
The absence of a predetermined offset parameter for a previous text character for which an unrecorded starting text character has been stored indicates that the unrecorded starting text character does not have a sibling node, at which time an offset parameter having the smallest value that is not used may be selected as the target offset parameter, or an offset parameter may be randomly selected as the target offset parameter among the offset parameters that are used.
S602: based on the target offset parameter, a storage location where the starting text character is not stored is determined as a target storage location.
In this step, since the offset parameter of a node is used to represent the offset when the child node of the node stores the text character, after the target offset parameter is determined, the offset is required for determining the size of the non-stored initial text character, so that the target storage location of the non-stored initial text character can be determined.
In one embodiment, the sum of the target offset parameter and the encoding value of the non-stored initial text character may be calculated as a pre-processing value, then the storage index of the non-stored initial text character is determined based on the pre-processing value as a target storage index, and finally the storage position associated with the target storage index is determined based on the association relationship between the pre-established storage index and the storage position as a target storage position.
The encoded value of the non-stored initial text character may be a value corresponding to a Unicode code of the non-stored initial text character, for example, a value corresponding to a Unicode code of text character "one" is 19968, and a value corresponding to a Unicode code of text character "lifting" is 20030.
Since Unicode codes of each text character have uniqueness, the preprocessing value obtained by adding the target offset parameter and the encoding value of the non-stored initial text character can ensure that the preprocessing values of the child nodes of the same father node are different.
Further, the determining, based on the pre-processing value, the storage index of the non-stored initial text character may be implemented by adding a pre-processing value to a pre-set value, where the pre-set value is used to avoid collision with the root node, and therefore, the size of the pre-set value may be the same as the offset parameter of the root node.
For example, when the offset parameter base [ root ] =1 of the root node, the initial text character is not stored as "1", the Unicode code of 1 is known as 49 in advance, and the default value is 1, there is a stored index=base [ root ] +char [1] +1=1+49+1=51 in which the initial text character "1" is not stored.
Optionally, the association relationship between the storage indexes and the storage positions may be determined according to actual requirements, where each storage index is associated with a storage position.
In one embodiment, the storage locations in the dictionary tree are located in a plurality of pre-partitioned storage blocks, and optionally each storage block may be a storage block (block), and when one storage block is full, a new storage block may be generated to store data, and optionally, the storage block may be a 256-bit storage block.
S603: according to the target storage location, the non-stored starting text characters are stored.
In this step, the non-stored initial text character may be stored in the target storage location, and it will be appreciated that the dictionary tree and the node containing the non-stored initial text character may be present.
In one embodiment, the non-stored text string and the non-stored starting text character are updated, i.e., the non-stored starting text character is deleted from the original non-stored text string and the next text character of the non-stored starting text character is taken as the new non-stored starting text character.
Illustratively, the non-stored text string is "fried chicken wings", and after the non-stored starting text character is "fried" is stored, the new non-stored text string is "chicken wings", and the new non-stored starting text character is "chicken".
Further, S601 may be executed back until each text character in the non-stored text character string is stored.
In the entity word storage method as shown in fig. 6 provided by the embodiment of the invention, in the text character string to be stored, the offset parameter of the previous text character stored in the initial text character not stored is determined as the target offset parameter, the storage position of the initial text character not stored is determined as the target storage position based on the target offset parameter, and the initial text character not stored is stored according to the target storage position, so that the target storage position of the initial text character not stored can be simply and quickly determined, and further the storage of the initial text character string not stored is realized.
In one embodiment, based on the above embodiment, in still another dictionary tree diagram shown in fig. 7, after the end text character in the text character string to be stored is stored, an entity word identification node (including a "label" node in the diagram) may be added on the basis of the node where the end text character is located, where the entity word identification node is used to indicate that the text character included in the parent node is the end text character.
In one embodiment, to facilitate subsequent recognition of the end text character of each stored entity word, the entity word identification node offset parameter may be set to the negative number with the largest unoccupied value, the encoding value corresponding to the entity word identification node is set to the negative number of the preset value, and the offset parameter of the end text character is set to the difference between the storage index of the end node and the verification parameter plus the preset value.
In one embodiment, the entity words that need to be stored include: the entity word types of the first, the first and the second names are typical types, and the corresponding type text characters are 1, so that the text character strings 1, 1 and 1 are stored at one time.
Unicode codes for each text character are given first, see Table 2:
TABLE 2
Character(s) 1 A first part Lifting device Finished products Name of name Two (II) Number (number)
Unicode code 49 19968 20030 25104 21517 20108 21495
In order to more clearly and intuitively understand the variation conditions of the offset parameters and the calibration parameters of each text character, a record table shown in table 3 is established:
TABLE 3 Table 3
char x1
i 0
base 1
check 0
The character row of the text is to be stored in the char row, the subsequent symbol char represents Unicode, the i row stores an index row, the base row shifts to a parameter row, and the check row checks to a parameter row. X1 represents a root node, which is null, and corresponds to i [ X1] =0, base [ X1] =1, check [ X1] =0, the preset value is 1, and the calculation formula of the offset parameter of the end text node is set to be base [ X ] =i ] ] -check [ X ] +1.
When storing text character 1, it is possible to calculate:
i[1]=base[x1]+char[1]+1=1+49+1=51;check[1]=base[x1]=1。
table 4 is obtained:
TABLE 4 Table 4
char x1 1
i 0 51
base 1
check 0 1
When the text character "one" needs to be stored, since the offset parameter of the text character "1" does not exist, it is necessary to determine base [1] first, and since the value 1 is already occupied by base [ x1], the unused minimum value is 2, then base [1] =2, then:
i [ one ] =base [1] +char [ one ] +1=2+19968+1= 19971; check [ one ] =base [1] =2.
Table 5 is obtained:
TABLE 5
char x1 1 A first part
i 0 51 19971
base 1 2
check 0 1 2
Since the text character "two" is connected in parallel with the text character "one", the text character "two" can be stored first, and since the text character "1" offset parameter base [1] exists, the target offset parameter of the text character "two" is base [1], then:
i [ two ] =base [1] +char [ two ] +1=2+20108+1=20111; check [ two ] =base [1] =2.
Table 6 is obtained:
TABLE 6
When the text character "lift" needs to be stored, since the offset parameter of the text character "one" does not exist, it is necessary to determine base [ one ] first, and since the values 1 and 2 are already occupied by base [ x1] and base [1], the unused minimum value is 3, then base [ one ] =3, then:
i [ lift ] =base [ one ] +char [ lift ] +1=3+20030+1=20034; check [ lift ] =base [ one ] =3.
Since the text character "lift" is the end text character of the entity word "lift", the entity word identifier node x2 can be newly created after the text character "lift" is located on the node, where it can be determined that base [ lift ] =i [ lift ] -check [ lift ] +1=20032, and then:
i [ x2] =base [ lift ] +char [ x2] +1=20032+1-1=20032; check [ x2] =base [ lift ] =20032.
Further, since-1 is the negative number with the largest unoccupied value, base [ x2] = -1 is set.
Table 7 is obtained:
TABLE 7
char x1 1 A first part Two (II) Lifting device x2
i 0 51 19971 20111 20034 20032
base 1 2 3 20032 -1
check 0 1 2 2 3 20032
When the text character "true" needs to be stored, since the offset parameter of the text character "true" has been determined when determining i [ x2] (base [ true ] =20032), then:
i [ adult ] =base [ adult ] +char [ adult ] +1=20032+25104+1= 45137; check [ adult ] =base [ lift ] =20032.
Table 8 is obtained:
TABLE 8
When the text character "name" needs to be stored, since the offset parameter of the text character "adult" does not exist, it is necessary to determine base [ adult ] first, and since the values 1, 2, 3, and 20032 are already occupied, the unused minimum value is 4, then base [ adult ] =4, then:
i [ name ] =base [ adult ] +char [ name ] +1=4+21517+1=21522; check [ name ] =base [ adult ] =4.
Since the text character "name" is the end text character of the entity word "one-time name", the entity word identifier node x3 can be newly built after the node where the text character "name" is located, and at this time, it can be determined that base [ name ] =i [ name ] -check [ name ] +1=21522-4+1=21519, then:
i [ x3] =base [ name ] +char [ x3] +1=21522+1-1=21519; check [ x3] =base [ name ] =21519.
Further, since-1 is occupied, then-2 is the negative number with the largest unoccupied value, then base [ x3] = -2 is set.
Table 9 is obtained:
TABLE 9
char x1 1 A first part Two (II) Lifting device x2 Finished products Name of name x3
i 0 51 19971 20111 20034 20032 45137 21522 21519
base 1 2 3 20032 -1 4 21519 -2
check 0 1 2 2 3 20032 20032 4 21519
When the text character "number" needs to be stored, since the offset parameter of the text character "two" does not exist, it is necessary to determine base [ two ] first, and since the values 1, 2, 3, 4, 21519 and 20032 are already occupied, the unused minimum value is 5, then base [ two ] =5, then:
i [ no ] =base [ two ] +char [ no ] +1=5+21495+1=21501; check [ two ] =base [ number ] =5.
Since the text character "number" is the end text character of the entity word "No. two", the entity word identification node x4 can be newly created after the node where the text character "number" is located, and at this time, it can be determined that base [ number ] =i [ number ] -check [ number ] +1=21522-4+1= 21479, then:
i [ x4] =base [ no ] +char [ x4] +1=21479+1-1= 21479; check [ x4] =base [ no ] = 21479.
Further, since-1 and-2 are occupied, then-3 is the negative number with the largest unoccupied value, then base [ x4] = -3 is set.
Table 10 is obtained:
table 10
char x1 1 A first part Two (II) Lifting device x2 Finished products Name of name x3 Number (number) x4
i 0 51 19971 20111 20034 20032 45137 21522 21519 21501 21479
base 1 2 3 5 20032 -1 4 21519 -2 21479 -3
check 0 1 2 2 3 20032 20032 4 21519 5 21479
As shown in fig. 8, a schematic diagram of a storage route of text characters in the dictionary tree in the above embodiment is shown, and the direction indicated by an arrow in the figure indicates the storage order.
In another embodiment of the present invention, as shown in fig. 9, there is further provided a method for identifying an entity word in a text segment, including:
s901: and acquiring text segment data to be identified, wherein the text segment data comprises text segments to be identified and specified entity word types, and the specified entity word types are determined based on the received entity word type selection operation.
S902: and determining the type text character corresponding to the entity word type based on the corresponding relation between the pre-established entity word type and the type text character, and taking the type text character as the appointed type text character.
S903: and identifying the text segment to be identified based on the text characters of the specified type and a pre-established dictionary tree to obtain an identification result, and taking the identification result as an entity word belonging to the type of the specified entity word, wherein the child nodes of the root node in the dictionary tree are used for storing the text characters of the type.
In the method for identifying an entity word in a text segment as shown in fig. 9, the text segment data to be identified may be obtained, where the text segment data includes the text segment to be identified and a specified entity word type, the specified entity word type is determined based on a received entity word type selection operation, and based on a corresponding relationship between a pre-established entity word type and a type text character, the type text character corresponding to the specified entity word type is determined as the specified type text character, and the text segment to be identified is identified based on the specified type text character and a pre-established dictionary tree, so as to obtain an identification result, and the identification result is used as an entity word belonging to the specified entity word type.
For the above embodiment, in the above step S901, the text segment to be identified may be the query to be identified, the specified entity word type is determined based on the received entity word type selection operation, and the entity word type selection interface may be displayed before identifying the query, so that the user may select the entity word type to which the required identified entity word belongs.
By way of example, the text passage to be identified may be a query, such as "KenX-based fried chicken of Australian X street" and the specified entity word type may be a brand type.
For step S902, this step is the same as or similar to step S102, and will not be described here.
For step S903, since the child node of the root node in the dictionary tree is used to store the type text characters, the range of the text segment to be recognized can be determined by specifying the type text characters, the range of the text segment to be recognized required to be retrieved is reduced, and recognition efficiency is further provided.
In one embodiment, among the child nodes of the follow-up nodes of the pre-established dictionary tree, the child node containing the text characters of the specified type is determined as a target node, and the text segment to be identified is identified from the target node, so that an identification result is obtained.
For example, in the case that the text character of the specified type is B and the text segment to be recognized is "good and good of a chicken of ken X group of australian street", as shown in fig. 3, the entity word in "good and good of a chicken of ken X group of australian street" may be recognized from the target node to which the text character of the specified type B belongs, and the recognition result "ken X group" may be obtained.
Optionally, among the text characters included in the text segment to be recognized, the text character included in each sub-word node of the target node may be determined as the target text character, and if the sub-node of the node to which the target text character belongs includes an entity word identification node, the text character string corresponding to the node to which the target text character belongs is determined to be a recognition result, where the entity word identification node is a terminal text character for indicating that the text character included in the parent node is an entity word, and if the next text character of the target text character is included in the sub-node of the node to which the target text character belongs in the text segment to be recognized, the next text character is returned as the target text character to execute the step of determining that the text character string corresponding to the node to which the target text character belongs is a recognition result if the sub-node of the node to which the target text character belongs includes the entity word identification node.
Based on the same inventive concept, according to the entity word storage method provided by the embodiment of the present invention, the embodiment of the present invention further provides an entity word storage device, as shown in fig. 10, where the device includes:
the entity word obtaining module 1001 is configured to obtain entity word data to be stored, where the entity word data includes an entity word to be stored and an entity word type to which the entity word to be stored belongs, and the entity word type to which the entity word to be stored belongs is used as the entity word type to be stored;
a first type text character determining module 1002, configured to determine, based on a pre-established correspondence between the entity word type and the type text character, a type text character corresponding to the entity word type to be stored as a target type text character;
an adding module 1003, configured to add a target type text character before a start text character of the entity word to be stored, to obtain a text string, as the text string to be stored;
the storage module 1004 is configured to store the text character string to be stored in a pre-established dictionary tree, where child nodes of a root node in the dictionary tree are used to store the type text characters.
Further, the storage module 1004 is specifically configured to determine, as the stored text string, a sub-text string stored in the dictionary tree in the text string to be stored, determine, as the non-stored text string, a sub-text string in the dictionary tree in the text string to be stored, and sequentially store, starting from a beginning text character of the non-stored text string, each text character according to a composition order of each text character in the non-stored text string.
Further, the initial text character of the non-stored text character string is used as the non-stored initial text character;
the storage module 1004 is specifically configured to determine, in a text string to be stored, an offset parameter of a previous text character in which the non-stored initial text character is stored, as a target offset parameter, and determine, based on the target offset parameter, a storage location of the non-stored initial text character, as a target storage location, and store the non-stored initial text character according to the target storage location, and update the non-stored text string and the non-stored initial text character, and return to executing the offset parameter of the previous text character in which the non-stored initial text character is determined to be stored until each text character in the non-stored text string is stored.
Further, the storage module 1004 is specifically configured to, when the predetermined offset parameter exists in the previous text character stored in the non-stored initial text character, take the predetermined offset parameter as the target offset parameter; alternatively, in the case where there is no predetermined offset parameter in the preceding text character in which the initial text character has been stored, the offset parameter having the smallest value that is not used is determined as the target offset parameter.
Further, the storage module 1004 is specifically configured to calculate, as a preprocessing value, a sum of the target offset parameter and the encoding value of the non-stored initial text character, determine, as a target storage index, a storage index of the non-stored initial text character, and determine, as a target storage location, a storage location associated with the target storage index, based on an association relationship between the pre-established storage index and the storage location.
Further, the storage module 1004 is further configured to add, after the end text character in the text character string to be stored is stored, an entity word identification node based on the node where the end text character is located, where the entity word identification node is configured to indicate that the text character included in the parent node is the end text character.
Further, each storage position in the dictionary tree is located in a plurality of storage blocks divided in advance.
Based on the same inventive concept, according to the method for identifying entity words in text segments provided by the embodiment of the present invention, the embodiment of the present invention further provides an apparatus for identifying entity words in text segments, as shown in fig. 11, where the apparatus includes:
a text segment data obtaining module 1101, configured to obtain text segment data to be identified, where the text segment data includes a text segment to be identified and a specified entity word type, and the specified entity word type is determined based on a received entity word type selection operation;
A second type text character determining module 1102, configured to determine, based on a pre-established correspondence between the entity word type and the type text characters, a type text character corresponding to the entity word type as a specified type text character;
a recognition module 1103 for recognizing the text segment to be recognized based on the text character of the specified type and a pre-established dictionary tree to obtain a recognition result, wherein the recognition result is used as an entity word belonging to the specified entity word type, and the child node of the root node in the dictionary tree is used for storing the text character of the type
Further, the identifying module 1103 is specifically configured to determine, among the child nodes of the heel node of the pre-established dictionary tree, the child node containing the text character of the specified type as the target node, and identify the text segment to be identified from the target node, so as to obtain the identifying result.
Further, the recognition module 1103 is specifically configured to determine, among text characters included in the text segment to be recognized, text characters included in each sub-word node of the target node as target text characters, and determine, if the sub-node of the node to which the target text characters belong includes an entity word identification node, a text character string corresponding to the node to which the target text characters belong as a recognition result, where the entity word identification node is a terminal text character for indicating that the text characters included in the parent node are entity words, and if, in the text segment to be recognized, a next text character of the target text characters is included in the sub-node of the node to which the target text characters belong, take the next text character as the target text characters, and return to execute, if the sub-node of the node to which the target text characters belong includes the entity word identification node, determine, as the recognition result, the text character string corresponding to the node to which the target text characters belong.
The embodiment of the invention also provides an electronic device, as shown in fig. 12, which comprises a processor 1201, a communication interface 1202, a memory 1203 and a communication bus 1204, wherein the processor 1201, the communication interface 1202 and the memory 1203 complete the communication with each other through the communication bus 1204,
a memory 1203 for storing a computer program;
the processor 1201 is configured to implement the above-mentioned entity word storing method or the entity word identifying and storing method in the text passage when executing the program stored in the memory 1203.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer readable storage medium is provided, in which a computer program is stored, the computer program implementing the steps of any of the above-mentioned entity word storing method or entity word identifying storing method in text segments when executed by a processor.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the entity-word storing methods or the entity-word recognition storing methods in text paragraphs of the embodiments described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for an apparatus, an electronic device, a computer readable storage medium, a computer program product, a description is relatively simple, as it is substantially similar to the method embodiments, as relevant see also part of the description of the method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (19)

1. A method for storing entity words, comprising:
acquiring entity word data to be stored, wherein the entity word data comprises an entity word to be stored and an entity word type to which the entity word to be stored belongs, and the entity word type to which the entity word to be stored belongs is used as the entity word type to be stored;
based on a pre-established corresponding relation between the entity word type and the type text characters, determining the type text characters corresponding to the entity word type to be stored as target type text characters;
adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string which is used as the text character string to be stored;
storing the text character strings to be stored into a pre-established dictionary tree, wherein child nodes of a root node in the dictionary tree are used for storing type text characters;
the storing the text character strings to be stored in a pre-established dictionary tree comprises the following steps:
Determining the sub-text character strings which are stored in the dictionary tree in the text character strings to be stored as the stored text character strings, and determining the sub-text character strings which are not stored in the dictionary tree in the text character strings to be stored as the non-stored text character strings;
and starting from the initial text character of the non-stored text character string, sequentially storing the text characters according to the composition sequence of the text characters in the non-stored text character string.
2. The method of claim 1, wherein a starting text character of the non-stored text string is taken as the non-stored starting text character;
starting from the initial text character of the non-stored text character string, storing each text character in sequence according to the composition sequence of each text character in the non-stored text character string, and comprising the following steps:
determining an offset parameter of a previous text character stored in the non-stored initial text character in the text character string to be stored as a target offset parameter;
determining a storage position of the non-stored initial text character as a target storage position based on the target offset parameter;
Storing the non-stored initial text characters according to the target storage position;
updating the non-stored text string and the non-stored initial text character, and returning to the step of determining the offset parameter of the previous text character in which the non-stored initial text character is stored until each text character in the non-stored text string is stored.
3. The method of claim 2, wherein the determining as the target offset parameter an offset parameter for a previous text character for which the non-stored starting text character has been stored comprises:
in the case that a predetermined offset parameter exists in the previous text character stored in the non-stored initial text character, taking the predetermined offset parameter as the target offset parameter; or,
and determining an offset parameter with the smallest unused value as the target offset parameter when the offset parameter is not determined in advance in the previous text character stored in the non-stored initial text character.
4. A method according to claim 2 or 3, wherein said determining a storage location of said non-stored starting text character as a target storage location based on said target offset parameter comprises:
Calculating the sum of the target offset parameter and the coding value of the non-stored initial text character as a preprocessing value;
determining a storage index of the non-stored initial text character as a target storage index based on the preprocessing value;
and determining a storage position associated with the target storage index as the target storage position based on the association relation between the pre-established storage index and the storage position.
5. A method according to any one of claims 1-3, wherein the method further comprises:
after the end text characters in the text character string to be stored are stored, adding entity word identification nodes on the basis of the nodes where the end text characters are located, wherein the entity word identification nodes are used for indicating that the text characters contained in the father nodes are the end text characters.
6. A method according to any one of claims 1-3, wherein each storage location in the dictionary tree is located in a plurality of pre-partitioned storage blocks.
7. A method for identifying entity words in a text segment, comprising:
acquiring text segment data to be identified, wherein the text segment data comprises text segments to be identified and appointed entity word types, and the appointed entity word types are determined based on received entity word type selection operation;
Based on a pre-established corresponding relation between the entity word type and the type text characters, determining the type text characters corresponding to the appointed entity word type as appointed type text characters;
identifying the text segment to be identified based on the text characters of the appointed type and a pre-established dictionary tree to obtain an identification result, wherein the identification result is used as an entity word belonging to the appointed entity word type, and a child node of a root node in the dictionary tree is used for storing the text characters of the type; the dictionary tree stores text strings, one text string is: before the initial text character of an entity word, adding a text character of a type corresponding to the entity word type to which the entity word belongs; in the dictionary tree, each text character in one text character string is sequentially stored according to the composition sequence.
8. The method of claim 7, wherein the identifying the text segment to be identified based on the specified type of text character and a pre-established dictionary tree to obtain an identification result comprises:
determining a child node containing the text characters of the appointed type from all child nodes of a root node of a pre-established dictionary tree as a target node;
And starting from the target node, identifying the text segment to be identified to obtain an identification result.
9. The method according to claim 8, wherein the identifying the text segment to be identified from the target node, to obtain an identification result, includes:
determining text characters contained in each child node of the target node from all text characters contained in the text segment to be identified as target text characters;
if the child node of the node to which the target text character belongs comprises an entity word identification node, determining a text character string corresponding to the node to which the target text character belongs as a recognition result, wherein the entity word identification node is used for indicating that the text character contained in the parent node is a terminal text character of an entity word;
and if the text segment to be recognized contains the next text character of the target text character in the child node of the node to which the target text character belongs, returning the next text character as the target text character to execute the step of determining that the text character string corresponding to the node to which the target text character belongs is a recognition result if the child node of the node to which the target text character belongs contains the entity word identification node.
10. An entity word storage device, comprising:
the entity word acquisition module is used for acquiring entity word data to be stored, wherein the entity word data comprises entity words to be stored and entity word types to which the entity words to be stored belong, and the entity word types to which the entity words to be stored belong are used as the entity word types to be stored;
the first type text character determining module is used for determining type text characters corresponding to the entity word types to be stored as target type text characters based on the corresponding relation between the pre-established entity word types and the type text characters;
the adding module is used for adding the target type text character before the initial text character of the entity word to be stored to obtain a text character string which is used as the text character string to be stored;
the storage module is used for storing the text character strings to be stored into a pre-established dictionary tree, and the child nodes of the root node in the dictionary tree are used for storing the type text characters;
the storage module is specifically configured to determine a sub-text string stored in the dictionary tree in the text string to be stored, determine a sub-text string not stored in the dictionary tree in the text string to be stored as an un-stored text string, and sequentially store each text character according to a composition sequence of each text character in the un-stored text string from a starting text character of the un-stored text string.
11. The apparatus of claim 10, wherein a starting text character of the non-stored text string is the non-stored starting text character;
the storage module sequentially stores each text character from the initial text character of the non-stored text character string according to the composition sequence of each text character in the non-stored text character string, and comprises the following steps: in the text character string to be stored, determining an offset parameter of a previous text character in which the non-stored starting text character has been stored as a target offset parameter, and determining a storage position of the non-stored starting text character as a target storage position based on the target offset parameter, and storing the non-stored starting text character according to the target storage position, and updating the non-stored text character string and the non-stored starting text character, and returning to execute the determining of the offset parameter of the previous text character in which the non-stored starting text character has been stored until each text character in the non-stored text character string has been stored.
12. The apparatus according to claim 11, wherein the storage module is configured to, in particular, use the predetermined offset parameter as the target offset parameter if there is already a predetermined offset parameter for a previous text character in which the non-stored starting text character has been stored; or determining an offset parameter with the smallest unused value as the target offset parameter when the predetermined offset parameter does not exist in the previous text character stored in the non-stored initial text character.
13. The apparatus according to claim 11 or 12, wherein the storage module is specifically configured to calculate a sum of the target offset parameter and the encoding value of the non-stored starting text character as a preprocessing value, and based on the preprocessing value, determine a storage index of the non-stored starting text character as a target storage index, and determine a storage location associated with the target storage index as the target storage location based on a pre-established association between the storage index and the storage location.
14. The apparatus according to any one of claims 10-12, wherein the storage module is further configured to add, after storing the end text character in the text string to be stored, an entity word identification node on the basis of the node where the end text character is located, where the entity word identification node is configured to indicate that the text character included in the parent node is the end text character.
15. The apparatus of any of claims 10-12, wherein each storage location in the dictionary tree is located in a plurality of pre-partitioned storage blocks.
16. An apparatus for recognizing an entity word in a text segment, comprising:
The text segment data acquisition module is used for acquiring text segment data to be identified, wherein the text segment data comprises text segments to be identified and specified entity word types, and the specified entity word types are determined based on the received entity word type selection operation;
the second type text character determining module is used for determining the type text characters corresponding to the appointed entity word types based on the corresponding relation between the pre-established entity word types and the type text characters, and taking the type text characters as the appointed type text characters;
the recognition module is used for recognizing the text segment to be recognized based on the text characters of the appointed type and a pre-established dictionary tree to obtain a recognition result, wherein the recognition result is used as an entity word belonging to the appointed entity word type, and a child node of a root node in the dictionary tree is used for storing the text characters of the type; the dictionary tree stores text strings, one text string is: before the initial text character of an entity word, adding a text character of a type corresponding to the entity word type to which the entity word belongs; in the dictionary tree, each text character in one text character string is sequentially stored according to the composition sequence.
17. The apparatus according to claim 16, wherein the identifying module is specifically configured to determine, among the child nodes of the root node of the pre-established dictionary tree, the child node containing the text character of the specified type as a target node, and identify the text segment to be identified from the target node, so as to obtain an identification result.
18. The apparatus according to claim 17, wherein the identifying module is specifically configured to determine, among text characters included in the text segment to be identified, text characters included in each child node of the target node as target text characters, and determine, if an entity word identifying node is included in the child nodes of the node to which the target text characters belong, a text character string corresponding to the node to which the target text characters belong as an identification result, wherein the entity word identifying node is an end text character for indicating that a text character included in a parent node thereof is an entity word, and if a next text character of the target text character is included in the child nodes of the node to which the target text character belongs in the text segment to be identified, return the next text character as the target text character to execute the step of determining, if the entity word identifying node is included in the child nodes of the node to which the target text character belongs, the text character string corresponding to the target text character as an identification result.
19. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-6 or 7-9 when executing a program stored on a memory.
CN202010091208.4A 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment Active CN111309851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010091208.4A CN111309851B (en) 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010091208.4A CN111309851B (en) 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111309851A CN111309851A (en) 2020-06-19
CN111309851B true CN111309851B (en) 2023-09-19

Family

ID=71144977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010091208.4A Active CN111309851B (en) 2020-02-13 2020-02-13 Entity word storage method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111309851B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000151419A (en) * 1998-11-05 2000-05-30 Asahi Chem Ind Co Ltd Data compression method and data compression unit
CN1987848A (en) * 2005-12-22 2007-06-27 国际商业机器公司 Character string processing method and apparatus
CN102227725A (en) * 2008-12-02 2011-10-26 艾利森电话股份有限公司 System and method for matching entities
KR20140123884A (en) * 2013-07-19 2014-10-23 주식회사 큐키 Type error correction method
CN105512118A (en) * 2014-09-22 2016-04-20 珠海金山办公软件有限公司 User demand feedback method and device
EP3306823A1 (en) * 2016-10-06 2018-04-11 Fujitsu Limited Encoding program, encoding apparatus and encoding method
CN108334491A (en) * 2017-09-08 2018-07-27 腾讯科技(深圳)有限公司 Text analyzing method, apparatus, computing device and storage medium
CN108874774A (en) * 2018-06-05 2018-11-23 浪潮软件股份有限公司 A kind of service calling method and system based on intention understanding
CN109213844A (en) * 2018-08-13 2019-01-15 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110262674A (en) * 2019-06-27 2019-09-20 北京金山安全软件有限公司 Chinese character input method and device based on pinyin input and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0117721D0 (en) * 2001-07-20 2001-09-12 Surfcontrol Plc Database and method of generating same
US9684648B2 (en) * 2012-05-31 2017-06-20 International Business Machines Corporation Disambiguating words within a text segment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000151419A (en) * 1998-11-05 2000-05-30 Asahi Chem Ind Co Ltd Data compression method and data compression unit
CN1987848A (en) * 2005-12-22 2007-06-27 国际商业机器公司 Character string processing method and apparatus
CN102227725A (en) * 2008-12-02 2011-10-26 艾利森电话股份有限公司 System and method for matching entities
KR20140123884A (en) * 2013-07-19 2014-10-23 주식회사 큐키 Type error correction method
CN105512118A (en) * 2014-09-22 2016-04-20 珠海金山办公软件有限公司 User demand feedback method and device
EP3306823A1 (en) * 2016-10-06 2018-04-11 Fujitsu Limited Encoding program, encoding apparatus and encoding method
CN108334491A (en) * 2017-09-08 2018-07-27 腾讯科技(深圳)有限公司 Text analyzing method, apparatus, computing device and storage medium
CN108874774A (en) * 2018-06-05 2018-11-23 浪潮软件股份有限公司 A kind of service calling method and system based on intention understanding
CN109213844A (en) * 2018-08-13 2019-01-15 腾讯科技(深圳)有限公司 A kind of text handling method, device and relevant device
CN110262674A (en) * 2019-06-27 2019-09-20 北京金山安全软件有限公司 Chinese character input method and device based on pinyin input and electronic equipment

Also Published As

Publication number Publication date
CN111309851A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN111831911A (en) Query information processing method and device, storage medium and electronic device
US9875264B2 (en) Identifying properties of a communication device
US10628403B2 (en) Annotation system for extracting attributes from electronic data structures
CN112347767B (en) Text processing method, device and equipment
CN110321560B (en) Method and device for determining position information from text information and electronic equipment
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN101689198A (en) Phonetic search using normalized string
CN113642311B (en) Data comparison method and device, electronic equipment and storage medium
CN110909528A (en) Script analysis method, script display method, device and electronic equipment
US20200097605A1 (en) Machine learning techniques for automatic validation of events
CN111859146B (en) Information mining method and device and electronic equipment
CN111309851B (en) Entity word storage method and device and electronic equipment
CN115952770B (en) Data standardization processing method and device, electronic equipment and storage medium
CN112579629A (en) Method for helping purchasers of electronic component enterprises to accurately find products
CN115544974A (en) Text data extraction method, system, storage medium and terminal
CN111460808A (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN112541357B (en) Entity identification method and device and intelligent equipment
CN106844718B (en) Data set determination method and device
CN111831823B (en) Corpus generation and model training method
CN114282119A (en) Scientific and technological information resource retrieval method and system based on heterogeneous information network
CN113128231A (en) Data quality inspection method and device, storage medium and electronic equipment
CN110532572A (en) Spell checking methods based on the tree-like naive Bayesian of TAN
CN112380873B (en) Method and device for determining selected items in specification document
CN110007779B (en) Input method prediction preference determining method, device, equipment and storage medium
CN112069126B (en) Catalog generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant