CN115186077A

CN115186077A - Slot position information extraction method and device, electronic equipment and storage medium

Info

Publication number: CN115186077A
Application number: CN202210806256.6A
Authority: CN
Inventors: 宁时贤
Original assignee: Beijing Longzhi Digital Technology Service Co Ltd
Current assignee: Beijing Longzhi Digital Technology Service Co Ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-10-14

Abstract

The disclosure relates to the technical field of artificial intelligence, and provides a slot position information extraction method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring content to be identified, and determining a slot position scene category corresponding to the content to be identified; determining a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an accurate matching mode, or a combination of the sequence labeling mode and the fuzzy matching mode, or a combination of the sequence labeling mode, the accurate matching mode and the fuzzy matching mode; and based on the slot position information extraction strategy, extracting slot position information corresponding to the slot position scene category from the content to be identified, and outputting the slot position information. The method and the device can shorten the response time of the input of the user, improve the user experience, and effectively solve the problems of selection and output of the nested entity.

Description

Slot position information extraction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a slot position information extraction method and apparatus, an electronic device, and a storage medium.

Background

When applying the conversation robot, it is generally necessary to identify key information in the query content input by the user by means of slot information extraction, so as to be used for controlling the jumping of the conversation logic, identifying keywords of the conversation, and the like.

Currently, there are two most commonly used methods for extracting slot information: one is a rule and dictionary based approach. According to the method, the character strings which accord with the mode are matched out as the slot position extraction result by designing a corresponding regular expression or other rules. The method has the disadvantages that the keywords input by the user are required to be completely consistent with the pre-stored word list, and when the number of the word lists is large, the processing speed of the regular expression is almost linearly increased, so that the response time of the server to the input of the user is easily overlong, and the user experience is very poor. Secondly, a sequence labeling method. The method configures a label for each word in the sentence according to a required mode, and the format of the label generally has two standards of IOB2 and IOBES. The method cannot solve the problems of selection and output of nested entities.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a slot information extraction method, an apparatus, an electronic device, and a computer-readable storage medium, so as to solve the problems that a response time of a current slot information extraction method to an input of a user is too long, a user experience is poor, and selection and output of a nested entity cannot be solved.

In a first aspect of the embodiments of the present disclosure, a slot information extraction method is provided, including:

acquiring content to be identified, and determining slot scene types corresponding to the content to be identified;

determining a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence marking mode and an accurate matching mode, or a combination of a sequence marking mode and a fuzzy matching mode, or a combination of a sequence marking mode, an accurate matching mode and a fuzzy matching mode;

and based on the slot position information extraction strategy, extracting slot position information corresponding to the slot position scene category from the content to be identified, and outputting the slot position information.

In a second aspect of the embodiments of the present disclosure, there is provided a slot position information extracting apparatus, including:

the content acquisition module is configured to acquire content to be identified and determine slot scene categories corresponding to the content to be identified;

the strategy determining module is configured to determine a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an accurate matching mode, or a combination of a sequence labeling mode and a fuzzy matching mode, or a combination of a sequence labeling mode, an accurate matching mode and a fuzzy matching mode;

and the information extraction module is configured to extract slot position information corresponding to the slot position scene category from the content to be identified based on the slot position information extraction strategy and output the slot position information.

In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.

Compared with the prior art, the beneficial effects of the embodiment of the disclosure at least comprise: determining a slot position scene category corresponding to the content to be identified by acquiring the content to be identified; determining a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an accurate matching mode, or a combination of the sequence labeling mode and the fuzzy matching mode, or a combination of the sequence labeling mode, the accurate matching mode and the fuzzy matching mode; based on the slot position information extraction strategy, the slot position information corresponding to the slot position scene category is extracted from the content to be identified, and the slot position information is output, so that the response time of the user input can be shortened, the user experience is improved, and the problems of selection and output of the nested entity can be effectively solved.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.

FIG. 1 is a scenario diagram of an application scenario of an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a slot information extraction method provided in an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a method for constructing an interval tree in a slot information extraction method provided by the embodiment of the present disclosure;

fig. 4 is a schematic diagram of a construction method of a multi-way tree in the slot information extraction method provided by the embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a slot information extraction apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

A slot bit information extraction method and apparatus according to an embodiment of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a scene schematic diagram of an application scenario according to an embodiment of the present disclosure. The application scenario may include terminal device 101, server 102, and network 103.

The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it may be various electronic devices having a microphone and a speaker and supporting a conversation with a human machine, including but not limited to a smartphone, a tablet computer, a laptop portable computer, a desktop computer, and the like; when the terminal apparatus 101 is software, it can be installed in the electronic apparatus as above. The terminal device 101 may be implemented as a plurality of pieces of software or software modules, or may be implemented as a single piece of software or software module, which is not limited in this disclosure. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search-type application, and the like, may be installed on the terminal device 101.

The server 102 may be a server that provides various services, for example, a man-machine conversation server that solves a question or the like input by a terminal device with which a communication connection is established. The man-machine conversation server can receive and analyze the request sent by the terminal equipment and generate a processing result. The server 102 may be a server, may also be a server cluster composed of several servers, or may also be a cloud computing service center, which is not limited in this disclosure.

The server 102 may be hardware or software. When the server 102 is hardware, it may be various electronic devices that provide various services to the terminal device 101. When the server 102 is software, it may be a plurality of software or software modules providing various services for the terminal device 101, or may be a single software or software module providing various services for the terminal device 101, which is not limited by the embodiment of the present disclosure.

The network 103 may be a wired network connected by a coaxial cable, a twisted pair cable, and an optical fiber, or may be a wireless network that can interconnect various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), and the like, which is not limited in the embodiment of the present disclosure.

In the embodiment of the present disclosure, the server 102 may establish a communication connection with the terminal apparatus 101 via the network 103 to receive the content to be recognized input by the terminal apparatus 101. Then, determining a slot scene category corresponding to the content to be identified; then, determining a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an accurate matching mode, or a combination of a sequence labeling mode and a fuzzy matching mode, or a combination of a sequence labeling mode, an accurate matching mode and a fuzzy matching mode; and then, based on the slot position information extraction strategy, slot position information corresponding to the slot position scene category is extracted from the content to be identified, and the slot position information is output, so that the response time of the user input can be shortened, the user experience is improved, and the problems of selection and output of nested entities can be effectively solved.

It should be noted that specific types, numbers, and combinations of the terminal device 101, the server 102, and the network 103 may be adjusted according to actual requirements of an application scenario, and the embodiment of the present disclosure does not limit this.

Fig. 2 is a schematic flowchart of a slot information extraction method according to an embodiment of the disclosure. The slot information extraction method of fig. 2 may be performed by the server 102 of fig. 1. As shown in fig. 2, the slot information extraction method includes:

step S201, acquiring a content to be identified, and determining a slot scene category corresponding to the content to be identified.

The content to be recognized generally refers to chat question and answer content or query content entered by a user when a conversation robot carries out a man-machine conversation. The content to be recognized may be in the form of voice or text, etc.

The slot scene category generally refers to an application scene category corresponding to the intention of the content to be identified, which is input by the user. Intent refers to what aspect of information the user asks. For example, if the content to be identified input by the user is "order an air ticket flying from shanghai to hainan tomorrow afternoon", the server can identify that the intention of the user is "order an air ticket", and the application scene category corresponding to the intention of "order an air ticket" is "order an air ticket scene". For another example, if the content to be identified input by the user is "recommend me tourist attractions in city a and city B", and the server can identify that the intention of the user is "attraction recommendation", the application scene category corresponding to the intention of "attraction recommendation" is "attraction recommendation scene".

As an example, the correspondence of intent to application scenario category may be predefined.

The user can input a question that the user wants to ask or query contents (i.e., contents to be recognized) through a terminal device (e.g., a smartphone), and send the contents to be recognized to a server (e.g., a man-machine conversation server) via the terminal device. When the server receives the content to be identified, the intention corresponding to the content to be identified can be identified, and then the application scene category corresponding to the intention is determined, namely the slot position scene category corresponding to the content to be identified can be determined.

Step S202, a slot position information extraction strategy corresponding to the slot position scene category is determined, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence marking mode and an accurate matching mode, or a combination of a sequence marking mode and a fuzzy matching mode, or a combination of a sequence marking mode, an accurate matching mode and a fuzzy matching mode.

As an example, semantic slots corresponding to each type of slot scene category and slot information extraction patterns corresponding to each type of semantic slot may be predefined. Taking the slot position scene category of "air ticket booking scene" as an example, the semantic slot corresponding to "air ticket booking scene" may include: "city name: () "," airport address: () "," time: () "," number: () "," seat position: () "and the like. Then, the semantic slots corresponding to each type of slot scene category can be classified to obtain a classification result, and a slot information extraction strategy corresponding to the slot scene category is determined according to the classification result.

The slot position information extraction mode mainly comprises an accurate extraction mode, a fuzzy extraction mode and a sequence labeling extraction mode. For example, for a general and non-fixed mode, such as a semantic slot of name, location, etc., the corresponding slot position information extraction mode is a sequence annotation extraction mode.

Step S203, based on the slot position information extraction strategy, extracting the slot position information corresponding to the slot position scene category from the content to be identified, and outputting the slot position information.

The slot position information refers to the content in the parentheses that needs to be filled into each semantic slot. For example, the semantic slot "city name: () "the slot information corresponding to the slot information may be a city name such as" shanghai, hainan, and sichuan ".

According to the technical scheme provided by the embodiment of the disclosure, the slot position scene category corresponding to the content to be identified is determined by acquiring the content to be identified; determining a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an accurate matching mode, or a combination of the sequence labeling mode and the fuzzy matching mode, or a combination of the sequence labeling mode, the accurate matching mode and the fuzzy matching mode; based on the slot position information extraction strategy, the slot position information corresponding to the slot position scene category is extracted from the content to be identified, and the slot position information is output, so that the response speed of the user input can be improved, the response time is shortened, the user experience is improved, and the problems of selection and output of the nested entity can be effectively solved.

In some embodiments, determining a slot information extraction policy corresponding to a slot scene category comprises:

acquiring at least one semantic slot corresponding to the slot scene category;

determining the slot position type of each semantic slot position, wherein the slot position type is any one of an accurate slot position, a fuzzy slot position or a universal slot position;

if the slot position type of each semantic slot position corresponding to the slot position scene category is an accurate slot position, the slot position information extraction strategy corresponding to the slot position scene category is an accurate matching mode;

if the slot type of each semantic slot corresponding to the slot scene category is a fuzzy slot, the slot information extraction strategy corresponding to the slot scene category is a fuzzy matching mode;

if the slot position type of the semantic slot position corresponding to the slot position scene type comprises an accurate slot position and a fuzzy slot position, the slot position information extraction strategy corresponding to the slot position scene type is a combination of an accurate matching mode and a fuzzy matching mode;

if the slot position type of the semantic slot position corresponding to the slot position scene type comprises a general slot position and an accurate slot position, the slot position information extraction strategy corresponding to the slot position scene type is a combination of a sequence marking mode and an accurate matching mode;

if the slot type of the semantic slot corresponding to the slot scene category comprises a general slot and a fuzzy slot, the slot information extraction strategy corresponding to the slot scene category is a combination of a sequence marking mode and a fuzzy matching mode;

and if the slot type of the semantic slot corresponding to the slot scene type comprises a general slot, an accurate slot and a fuzzy slot, the slot information extraction strategy corresponding to the slot scene type is a combination of a sequence marking mode, an accurate matching mode and a fuzzy matching mode.

In some embodiments, determining the slot type for each semantic slot comprises:

acquiring a response performance parameter value, an application field corresponding to the semantic slot position and a mode matching requirement corresponding to the semantic slot position;

and determining the slot position type corresponding to each semantic slot position according to the response performance parameter value, the application field and the pattern matching requirement.

In general, each type of slot scene category typically corresponds to one or more semantic slots.

The response performance parameter value generally refers to a response speed and a response duration of the dialog server to the content to be recognized input by the user.

The application fields include but are not limited to the industrial manufacturing field, the service industry field, the agriculture field, the aerospace field, the intelligent technology field and the like. The above-mentioned respective domains may be further subdivided into more subdivided domains. For example, the service industry can be further subdivided into the related fields of clothes, food, residence and business, the financial service field, the electronic commerce field and the like, which are closely related to the daily life of people.

The pattern matching requirement generally refers to a matching pattern required to be adopted for extracting slot position information corresponding to a semantic slot. The matching patterns typically include exact matches, fuzzy matches, and object fixation pattern matches.

In an embodiment, a target field in which slot information extraction needs to be performed by using an exact matching mode may be predefined. In general, the target area refers to areas where the profession is more specialized and the term of art involved is less used. Such as the real estate domain, the financial domain, etc.

As an example, after determining an application field corresponding to a semantic slot, determining whether the application field belongs to a target field, and if the application field belongs to the target field and a response performance parameter value (e.g., response duration) meets a preset first response performance condition (e.g., response duration is within 10 ms), determining that the slot type of the semantic slot is an accurate slot.

As another example, if it is determined that a response performance parameter value (e.g., a response duration) of a certain semantic slot satisfies a preset second response performance condition (e.g., the response duration may be relaxed to 10ms to 50 ms), and a pattern matching requirement thereof is that a character string matching the semantic slot in the content to be recognized needs to be approximately (not accurately) searched, the slot position type of the semantic slot may be determined as a fuzzy slot position.

As yet another example, if a semantic slot is generic and the pattern matching requirement is no fixed pattern, then the semantic slot may be determined to be a generic slot position. For example, semantic slots such as name, location, etc. belong to the common slot position.

In an exemplary embodiment, if a slot scene category corresponds to 5 semantic slots, the 5 semantic slots can be numbered as semantic slots a, B, C, D, and E, respectively, wherein according to the above steps, if it is determined that the semantic slots a, B, C, D, and E are all accurate slots, the slot information extraction policy corresponding to the slot scene category is an accurate matching mode. If the semantic slots A, B, C, D and E are all fuzzy slot positions according to the steps, the slot position information extraction strategy corresponding to the slot position scene category is a fuzzy matching mode. If the semantic slots A, B and C are determined to be accurate slots and the semantic slots D and E are fuzzy slots according to the steps, the slot information extraction strategy corresponding to the slot scene category is a combination of an accurate matching mode and a fuzzy matching mode. If the semantic slots A and B are determined to be the universal slot positions and the semantic slots C, D and E are the accurate slot positions according to the steps, the slot position information extraction strategy corresponding to the slot position scene type is a combination of a sequence labeling mode and an accurate matching mode. If the semantic slots A and B are determined to be general slot positions and the semantic slots C, D and E are fuzzy slot positions according to the steps, the slot position information extraction strategy corresponding to the slot position scene category is a combination of a sequence labeling mode and a fuzzy matching mode. If the semantic slot A is determined to be a general slot, the semantic slots B and C are accurate slots, and the semantic slots D and E are fuzzy slots according to the steps, the slot information extraction strategy corresponding to the slot scene type is a combination of a sequence labeling mode, an accurate matching mode and a fuzzy matching mode.

In some embodiments, the slot information extraction strategy is an exact match pattern.

Based on the slot position information extraction strategy, slot position information corresponding to the slot position scene category is extracted from the content to be identified, and the slot position information is output, wherein the slot position information extraction strategy comprises the following steps:

based on an AC automaton algorithm and a preset prefix tree, performing multi-mode string matching on the content to be identified to obtain a first character sequence which is successfully matched;

and determining first slot position information corresponding to each accurate slot position of the slot position scene category according to the first character sequence, and outputting the first slot position information.

The accurate matching is mainly to solve the query problem of the head information of the content to be identified input by the user.

An AC (Aho-Corasick automation) automaton algorithm is based on a finite automation multi-mode algorithm, and the mode set is converted into a mode matching machine in a preprocessing stage, so that the AC automaton is called.

As an example, a prefix tree can be established in advance according to a series of keywords pre-stored in a word list, and comparison of unnecessary character strings is reduced to the maximum extent by utilizing the public prefixes of the character strings, so that the query efficiency is improved. And then, acquiring the contents to be identified input by the user, traversing the characters in the contents to be identified one by one, and performing multi-mode string matching on the contents to be identified on the constructed prefix tree based on an AC automaton algorithm. When matching to EOT special characters (namely end of word symbols), the matching is successful, and a first character sequence is obtained. Then, the standard keyword (i.e. the first slot position information) corresponding to the first character sequence is found out, and the first slot position information is output.

In some embodiments, determining first slot position information corresponding to the precise slot position according to the first character sequence, and outputting the first slot position information includes:

judging whether the slot position intervals among the first slot position information of the accurate slot positions are overlapped;

and if the slot position intervals between the first slot position information are overlapped, constructing an interval tree, filtering the first slot position information based on the interval tree to obtain second slot position information, and outputting the second slot position information.

Referring to fig. 3, as an example, the slot information is extracted by using an exact matching mode, and the extracted slot information may have a situation where slot intervals overlap. For example, the content to be identified input by the user is "order a ticket starting from shanghai south and at a destination hainan", when the first slot position information corresponding to the semantic slot "city name" is extracted by adopting the precise matching mode, it is found that intersection exists between "shanghai" and "hainan" in the extracted first slot position information "shanghai south" (namely, the slot position intervals are overlapped), the first slot position information can be filtered through the constructed interval tree, namely, the matching is continued backwards, and finally, the correct result "to reach the city" hainan "in the hainan" is matched, so that the second slot position information "shanghai", "hainan" is obtained.

Compared with the traditional exhaustion method, the slot information with slot interval overlapping is filtered by constructing the interval tree, so that the complexity of interval overlapping judgment can be reduced, the correct slot information can be output, the response speed of question answering can be improved, and the response time can be shortened.

In some embodiments, the slot information extraction policy is a fuzzy matching pattern.

finding out the longest continuous matching character string which is continuously matched with the keywords stored in the preset dictionary tree in the content to be recognized;

searching a non-continuous matching character string which is non-continuously matched with the keywords stored in the dictionary tree in the content to be recognized;

calculating a first similarity between the longest continuous matching character string and the content to be recognized;

calculating a second similarity between the non-continuous matching character string and the content to be identified;

and determining third slot position information corresponding to each fuzzy slot position according to the first similarity and the second similarity.

The longest continuous matching character string is the longest continuous matching sub-string which does not contain 'garbage' elements in the content to be recognized and is continuously matched with the keywords in the dictionary tree. Where a "garbage" element generally refers to an element that is not valuable in some sense, e.g., blank lines, whitespace, etc.

The non-continuous matching character string is a character string consisting of a group of non-continuous matching sub-strings.

The fuzzy matching mode adopts a dynamic programming strategy, and the specific process is as follows:

as an example, assuming that a keyword "shanghai south station" is pre-stored in the dictionary tree, and the content to be recognized input by the user is "order a ticket for a destination hainan, starting from the south station of shanghai", when the slot position information corresponding to the semantic slot "city name" in the content is extracted by using the fuzzy matching mode, the longest continuous matching character string "shanghai" in which the character is continuously matched with the keyword "shanghai south station" can be found by traversing each character of the content to be recognized. Meanwhile, a non-continuous matching character string 'south station of Shanghai' which is non-continuously matched with the key word 'south station of Shanghai' can be found. Then, calculating a first similarity between the longest continuous matching character string and the content to be recognized, namely calculating a first similarity between the Shanghai and the Shanghai south station; and calculating a second similarity between the non-continuous matching character strings and the content to be recognized, namely calculating a second similarity between the 'south station of Shanghai' and the 'south station of Shanghai'. The first similarity and the second similarity may be cosine similarities. Illustratively, if the calculated first similarity is 85% and the second similarity is 90%, the third slot position information corresponding to the semantic slot "city" is "south station of shanghai".

The non-continuous matching character string is composed of a plurality of non-continuous matching sub-strings, and the maximum character interval among the plurality of non-continuous matching sub-strings is 1.

The non-consecutive matching string "south station of shanghai" in the above example is a string composed of a set of non-consecutive matching sub-strings "shanghai", "of" and "south station". The ' shanghai ' is continuously matched with the ' shanghai ' in the keyword ' shanghai south station ', the ' south station ' is continuously matched with the ' south station ' in the keyword ' shanghai south station ', and the ' shanghai ' and the ' south station ' in the ' shanghai south station ' are separated by one character ', namely, 1 character.

In some embodiments, the slot information extraction strategy is a combination of a sequence annotation pattern, an exact match pattern, and a fuzzy match pattern.

based on the accurate matching mode, extracting accurate slot position information corresponding to the accurate slot position from the content to be identified;

based on a fuzzy matching mode, extracting fuzzy slot position information corresponding to the fuzzy slot position from the content to be identified;

extracting general slot position information corresponding to the general slot positions from the content to be identified based on a sequence marking mode;

and outputting the precise slot position and the precise slot position information corresponding to the precise slot position, the fuzzy slot position information corresponding to the fuzzy slot position, and the general slot position information corresponding to the general slot position.

The sequence labeling mode may refer to two slot information extraction manners, i.e. IOB2 and IOBES in the prior art, and is not described herein again.

As an example, it is assumed that a slot scene category corresponds to 5 semantic slots, which are respectively numbered as semantic slots a, B, C, D, and E, where the semantic slot a is a general slot, the semantic slots B and C are precise slots, and the semantic slots D and E are fuzzy slots. Then, the general slot position information a corresponding to the semantic slot A can be extracted from the content to be identified based on the sequence marking mode; and extracting accurate slot position information B and C respectively corresponding to the semantic slots B and C from the content to be recognized based on the accurate matching mode, and extracting fuzzy slot position information D and E respectively corresponding to the semantic slots D and E from the content to be recognized based on the fuzzy matching mode. Then, output semantic slot a: general slot position information a, semantic slot B: accurate slot position information b, semantic slot C: accurate slot position information c, semantic slot D: fuzzy slot position information d, semantic slot E: the slot position information e is blurred.

In some embodiments, outputting the precise slot position and its corresponding precise slot position information, the fuzzy slot position information corresponding to the fuzzy slot position, and the general slot position and its corresponding general slot position information includes:

judging whether the slot position intervals among the accurate slot position information, the fuzzy slot position information and the general slot position information are overlapped or not;

and if the slot position intervals among the accurate slot position information, the fuzzy slot position information and the general slot position information are overlapped, constructing a multi-branch tree, filtering the accurate slot position information, the fuzzy slot position information and the general slot position information based on the multi-branch tree to obtain final slot position information corresponding to the slot position scene type, and outputting the final slot position information.

In some embodiments, constructing a multi-branch tree and filtering the accurate slot information, the fuzzy slot information and the general slot information based on the multi-branch tree to obtain final slot information corresponding to the slot scene category includes:

dividing the accurate slot position information, the fuzzy slot position information and the general slot position information into a target slot position information and a plurality of other slot position information;

constructing a root node of the multi-branch tree, and adding target slot position information into the multi-branch tree;

sequentially reading other slot position information, and comparing the slot position interval of the currently read other slot position information with the node interval of the leaf node of the multi-branch tree to obtain a comparison result;

if the comparison result is that the intersection exists between the slot position interval and the node interval, a child node is added to a father node of the leaf node to widen the current slot position interval of the multi-branch tree;

if the comparison result is that no intersection exists between the slot position interval and the node interval, a child node is added at the leaf node to deepen the current slot position interval of the multi-branch tree;

and returning the traversal paths from the root node to all leaf nodes of the multi-branch tree, and outputting final slot position combination information corresponding to the traversal paths.

As an example, assuming that there is a slot interval overlap between the general slot information a, the precise slot information b, the precise slot information c, the fuzzy slot information d, and the fuzzy slot information e, the precise slot information, the fuzzy slot information, and the general slot information may be divided into one target slot information and a plurality of other slot information. Illustratively, the precise slot position information b is divided into target slot position information, and the general slot position information a, the precise slot position information c, the fuzzy slot position information d and the fuzzy slot position information e are divided into other slot position information. Constructing a ROOT node (ROOT), and adding the target slot position information into a child0 leaf node of the multi-branch tree; and then, reading other slot information one time according to the sequence of the general slot information a → the accurate slot information c → the fuzzy slot information d → the fuzzy slot information e, and comparing the currently read slot intervals of the other slot information with the node intervals of all leaf nodes of the multi-branch tree to obtain a comparison result. If the comparison result is that the intersection exists between the slot position interval of the target slot position information and the node interval of the leaf node of the multi-branch tree, a child node is added in the parent node of the leaf node to widen the current slot position interval of the multi-branch tree; if the comparison result is that no intersection exists between the slot position interval and the node interval, a child node is added at the leaf node to deepen the current slot position interval of the multi-branch tree; and finally, returning the traversal paths from the root node to all the leaf nodes, wherein each path represents a set of final slot combination information.

All the entities of the general slot information a, the accurate slot information b, the accurate slot information c, the fuzzy slot information d and the fuzzy slot information e can be sequenced (for example, a char _ offset sequencing method can be adopted), the slot information with the maximum entity total length is preferentially output, the slot information with the maximum entity total length is determined as the target slot information, and the target slot information is added into the multi-branch tree.

With reference to fig. 4, a root node (root) is first constructed, if the shanghai is ranked at the first position according to the entity sorting result, the shanghai can be added to the child0 leaf node of the multi-branch tree, and the shanghai and the section [1,2] thereof are stored in the node; then, reading that the next entity is Hainan, the interval of the Hainan is [2,3], and the next entity is overlapped with the interval of Shanghai, a child node child1 is newly added to the father node, hainan is added to a leaf node of child1, and Hainan and the interval of Hainan exist in the node; then, reading that the next entity is 'Beijing', the interval is [4,5], and the next entity is not overlapped with the intervals of 'Shanghai' and 'Hainan', respectively adding a child node child2 at two leaf nodes of child0 and child1, and storing the Beijing and the interval thereof at the leaf node of the child 2; and then, reading that the next entity is 'Sichuan', the interval is [6,7], the next entity is not overlapped with the intervals of 'Shanghai', 'Hainan' and 'Beijing', and a child node child3 is newly added at two leaf nodes of child0 and child1 respectively. All the entities are read in sequence in the above manner. And finally, returning the traversal paths from the root node to all the leaf nodes, wherein each path represents a set of final slot position information. For example, the path child0 → child2 → child3 indicates a combination of slot bit information of shanghai, beijing, and sichuan; the path child1 → child2 → child3 indicates a combination slot bit information of Hainan, beijing, and Sichuan.

If the slot result has an overlapping interval, widening the current slot interval of the multi-branch tree, namely adding a new child node (such as adding child 1) to the father node; if the slot result does not have the overlapping interval, deepening the current slot interval of the multi-branch tree, namely, adding a new child node (such as adding child 2) to the leaf node.

According to the technical scheme provided by the embodiment of the disclosure, the prefix tree structure and the AC automata algorithm are adopted to replace the traditional regular expression, the recall rate can be further improved by adding fuzzy matching, and the selection and output problems of the nested entities can be effectively solved by constructing the interval tree and the multi-branch tree. Specifically, whether the slot position intervals of different slot position information are overlapped or not is judged through the interval tree, and the problems of slot position loss and optimal non-overlapped slot position combination are effectively solved. And storing various possible combination results of the nesting slot position information through the multi-branch tree, and efficiently outputting the arrangement combination results of the slot positions. In a complex scene (such as an intelligent man-machine conversation service scene), a plurality of slot position information extracted by an accurate matching mode, a fuzzy matching mode and a sequence marking mode can be fused through a tree structure, so that an optimal slot position combination is obtained, and effective slot position results are efficiently merged and output.

In practical application, the precise matching mode can be combined with the fuzzy matching mode for use, and the accuracy of information extraction is ensured by setting a higher threshold value.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 5 is a schematic diagram of a slot bit information extraction apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the slot information extracting apparatus includes:

a content obtaining module 501, configured to obtain content to be identified, and determine a slot scene category corresponding to the content to be identified;

a policy determining module 502 configured to determine a slot information extraction policy corresponding to a slot scene category, where the slot information extraction policy is one or a combination of an exact matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an exact matching mode, or a combination of a sequence labeling mode and a fuzzy matching mode, or a combination of a sequence labeling mode, an exact matching mode and a fuzzy matching mode;

the information extraction module 503 is configured to extract slot information corresponding to the slot scene category from the content to be identified based on the slot information extraction policy, and output the slot information.

In some embodiments, the policy determining module 502 includes:

a slot position obtaining unit configured to obtain at least one semantic slot position corresponding to a slot position scene category;

the type determining unit is configured to determine a slot position type of each semantic slot position, wherein the slot position type is any one of an accurate slot position, a fuzzy slot position or a universal slot position;

the first determining unit is configured to determine that the slot position information extraction strategy corresponding to the slot position scene category is an accurate matching mode if the slot position type of each semantic slot position corresponding to the slot position scene category is an accurate slot position;

the second determining unit is configured to determine that the slot position information extraction strategy corresponding to the slot position scene category is a fuzzy matching mode if the slot position type of each semantic slot position corresponding to the slot position scene category is a fuzzy slot position;

the third determining unit is configured to extract the slot information extraction strategy corresponding to the slot scene category into a combination of an accurate matching mode and a fuzzy matching mode if the slot type of the semantic slot corresponding to the slot scene category comprises an accurate slot and a fuzzy slot;

the fourth determining unit is configured to determine that the slot information extraction strategy corresponding to the slot scene type is a combination of a sequence labeling mode and an accurate matching mode if the slot type of the semantic slot corresponding to the slot scene type includes a general slot and an accurate slot;

a fifth determining unit, configured to determine that the slot information extraction strategy corresponding to the slot scene category is a combination of a sequence labeling mode and a fuzzy matching mode if the slot type of the semantic slot corresponding to the slot scene category includes a general slot and a fuzzy slot;

a sixth determining unit, configured to determine that the slot information extraction strategy corresponding to the slot scene category is a combination of a sequence marking mode, an accurate matching mode and a fuzzy matching mode if the slot type of the semantic slot corresponding to the slot scene category includes a general slot, an accurate slot and a fuzzy slot.

In some embodiments, the type determining unit may be specifically configured to:

and determining the slot type corresponding to each semantic slot according to the response performance parameter value, the application field and the pattern matching requirement.

In some embodiments, the slot information extraction policy is an exact match pattern. The information extraction module 503 includes:

the matching unit is configured to perform multi-mode string matching on the content to be recognized based on an AC automata algorithm and a preset prefix tree so as to obtain a first character sequence which is successfully matched;

and the output unit is configured to determine first slot position information corresponding to each accurate slot position of the slot position scene category according to the first character sequence and output the first slot position information.

In some embodiments, the output unit may be specifically configured to:

judging whether slot position intervals among the first slot position information of the accurate slot positions are overlapped or not;

In some embodiments, the slot information extraction strategy is a fuzzy matching pattern. The information extraction module 503 includes:

the first searching unit is configured to search the longest continuous matching character string which is continuously matched with the keywords stored in the preset dictionary tree in the content to be recognized;

the second searching unit is configured to search discontinuous matching character strings which are discontinuously matched with the keywords stored in the dictionary tree in the content to be recognized;

a first calculating unit configured to calculate a first similarity between the longest continuous matching character string and the content to be recognized;

the second calculation unit is configured to calculate a second similarity between the non-continuous matching character strings and the content to be recognized;

and the determining unit is configured to determine third slot information corresponding to each fuzzy slot according to the first similarity and the second similarity.

In some embodiments, the non-consecutive matching strings are composed of a plurality of non-consecutive matching sub-strings, and the maximum character spacing between the plurality of non-consecutive matching sub-strings is 1.

In some embodiments, the slot information extraction strategy is a combination of a sequence annotation pattern, an exact match pattern, and a fuzzy match pattern. The information extraction module 503 includes:

a first extraction unit configured to extract accurate slot position information corresponding to the accurate slot position from the content to be recognized based on the accurate matching pattern;

the second extraction unit is configured to extract fuzzy slot position information corresponding to the fuzzy slot position from the content to be identified based on the fuzzy matching mode;

the third extraction unit is configured to extract the universal slot position information corresponding to the universal slot position from the content to be identified based on the sequence marking mode;

and the information output unit is configured to output the precise slot position and the precise slot position information corresponding to the precise slot position, the fuzzy slot position information corresponding to the fuzzy slot position, and the general slot position information corresponding to the general slot position.

In some embodiments, the information output unit may be specifically configured to:

judging whether slot position intervals among the accurate slot position information, the fuzzy slot position information and the general slot position information are overlapped or not;

if the comparison result is that the slot position interval and the node interval have intersection, a child node is newly added to a parent node of the leaf node so as to widen the current slot position interval of the multi-branch tree;

and returning the traversal paths from the root node to all leaf nodes of the multi-branch tree, and outputting final combination slot position information corresponding to the traversal paths.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present disclosure.

Fig. 6 is a schematic diagram of an electronic device 6 provided by an embodiment of the present disclosure. As shown in fig. 6, the electronic apparatus 6 of this embodiment includes: a processor 601, a memory 602, and a computer program 603 stored in the memory 602 and executable on the processor 601. The steps in the various method embodiments described above are implemented when the processor 601 executes the computer program 603. Alternatively, the processor 601 realizes the functions of each module/unit in each apparatus embodiment described above when executing the computer program 603.

The electronic device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 6 may include, but is not limited to, a processor 601 and a memory 602. Those skilled in the art will appreciate that fig. 6 is merely an example of an electronic device 6, and does not constitute a limitation of the electronic device 6, and may include more or less components than those shown, or different components.

The Processor 601 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc.

The storage 602 may be an internal storage unit of the electronic device 6, for example, a hard disk or a memory of the electronic device 6. The memory 602 may also be an external storage device of the electronic device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 6. The memory 602 may also include both internal and external storage units of the electronic device 6. The memory 602 is used for storing computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the above embodiments may be realized by the present disclosure, and the computer program may be stored in a computer readable storage medium to instruct related hardware, and when the computer program is executed by a processor, the steps of the above method embodiments may be realized. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and they should be construed as being included in the scope of the present disclosure.

Claims

1. A slot bit information extraction method is characterized by comprising the following steps:

acquiring content to be identified, and determining a slot position scene category corresponding to the content to be identified;

determining a slot position information extraction strategy corresponding to the slot position scene category, wherein the slot position information extraction strategy is one or a combination of an accurate matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an accurate matching mode, or a combination of a sequence labeling mode and a fuzzy matching mode, or a combination of a sequence labeling mode, an accurate matching mode and a fuzzy matching mode;

based on the slot position information extraction strategy, slot position information corresponding to the slot position scene category is extracted from the content to be identified, and the slot position information is output.

2. The method of claim 1, wherein determining the slot information extraction policy corresponding to the slot scene category comprises:

acquiring at least one semantic slot corresponding to the slot scene category;

if the slot position type of each semantic slot position corresponding to the slot position scene type is an accurate slot position, the slot position information extraction strategy corresponding to the slot position scene type is an accurate matching mode;

if the slot position type of each semantic slot position corresponding to the slot position scene type is a fuzzy slot position, the slot position information extraction strategy corresponding to the slot position scene type is a fuzzy matching mode;

if the slot type of the semantic slot corresponding to the slot scene category comprises an accurate slot and a fuzzy slot, the slot information extraction strategy corresponding to the slot scene category is a combination of an accurate matching mode and a fuzzy matching mode;

if the slot position type of the semantic slot position corresponding to the slot position scene type comprises a general slot position and a fuzzy slot position, the slot position information extraction strategy corresponding to the slot position scene type is a combination of a sequence marking mode and a fuzzy matching mode;

and if the slot position type of the semantic slot position corresponding to the slot position scene type comprises a general slot position, an accurate slot position and a fuzzy slot position, the slot position information extraction strategy corresponding to the slot position scene type is a combination of a sequence marking mode, an accurate matching mode and a fuzzy matching mode.

3. The method of claim 2, wherein determining the slot type for each of the semantic slots comprises:

4. The method of claim 2, wherein the slot information extraction strategy is an exact match pattern;

based on an AC automata algorithm and a preset prefix tree, carrying out multi-mode string matching on the content to be identified so as to obtain a first character sequence which is successfully matched;

5. The method of claim 4, wherein determining first slot information corresponding to the precise slot according to the first sequence of characters and outputting the first slot information comprises:

judging whether slot position intervals among the first slot position information of the accurate slot positions are overlapped;

and if the slot position intervals among the first slot position information are overlapped, constructing an interval tree, filtering the first slot position information based on the interval tree to obtain second slot position information, and outputting the second slot position information.

6. The method of claim 2, wherein the slot information extraction strategy is a fuzzy matching pattern;

finding out the longest continuous matching character string which is continuously matched with the keywords stored in a preset dictionary tree in the content to be identified;

searching discontinuous matching character strings which are discontinuously matched with the keywords stored in the dictionary tree in the content to be recognized;

7. The method of claim 6, wherein the non-consecutive matching strings consist of a plurality of non-consecutive matching sub-strings, and wherein a maximum character spacing between the plurality of non-consecutive matching sub-strings is 1.

8. The method of claim 2, wherein the slot information extraction strategy is a combination of a sequence labeling mode, an exact matching mode, and a fuzzy matching mode;

based on the slot position information extraction strategy, extracting slot position information corresponding to the slot position scene category from the content to be identified, and outputting the slot position information, wherein the slot position information extraction strategy comprises the following steps:

extracting accurate slot position information corresponding to the accurate slot position from the content to be identified based on the accurate matching mode;

based on the fuzzy matching mode, extracting fuzzy slot position information corresponding to the fuzzy slot position from the content to be identified;

based on the sequence marking mode, extracting general slot position information corresponding to the general slot position from the content to be identified;

and outputting the accurate slot position and the accurate slot position information corresponding to the accurate slot position, the fuzzy slot position information corresponding to the fuzzy slot position, and the universal slot position information corresponding to the universal slot position.

9. The method of claim 8, wherein outputting the precise slot and its corresponding precise slot information, the ambiguous slot information corresponding to the ambiguous slot, and the generic slot and its corresponding generic slot information comprises:

judging whether the slot position intervals among the accurate slot position information, the fuzzy slot position information and the general slot position information are overlapped;

and if the slot position intervals among the accurate slot position information, the fuzzy slot position information and the general slot position information are overlapped, constructing a multi-branch tree, filtering the accurate slot position information, the fuzzy slot position information and the general slot position information based on the multi-branch tree, obtaining final slot position information corresponding to the slot position scene type, and outputting the final slot position information.

10. The method of claim 9, wherein constructing a multi-branch tree and filtering the precise slot information, the fuzzy slot information, and the generic slot information based on the multi-branch tree to obtain final slot information corresponding to the slot scene category comprises:

dividing the accurate slot position information, the fuzzy slot position information and the general slot position information into target slot position information and a plurality of other slot position information;

constructing a root node of a multi-branch tree, and adding the target slot position information into the multi-branch tree;

sequentially reading one piece of the other slot position information, and comparing the slot position interval of the currently read other slot position information with the node interval of the leaf node of the multi-branch tree to obtain a comparison result;

if the comparison result is that the slot position interval and the node interval have intersection, a new child node is added to a father node of the leaf node so as to widen the current slot position interval of the multi-branch tree;

if the comparison result is that no intersection exists between the slot position interval and the node interval, adding a new child node at the leaf node to deepen the current slot position interval of the multi-branch tree;

11. A slot position information extraction device, comprising:

a strategy determining module configured to determine a slot information extraction strategy corresponding to the slot scene category, where the slot information extraction strategy is one or a combination of an exact matching mode or a fuzzy matching mode, or a combination of a sequence labeling mode and an exact matching mode, or a combination of a sequence labeling mode and a fuzzy matching mode, or a combination of a sequence labeling mode, an exact matching mode and a fuzzy matching mode;

and the information extraction module is configured to extract the slot position information corresponding to the slot position scene category from the content to be identified based on the slot position information extraction strategy and output the slot position information.

12. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor realizes the steps of the method according to any one of claims 1 to 10 when executing the computer program.

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.