CN113704420A

CN113704420A - Method and device for identifying role in text, electronic equipment and storage medium

Info

Publication number: CN113704420A
Application number: CN202110294576.3A
Authority: CN
Inventors: 李晨曦; 荆宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-11-26

Abstract

The application provides a method and a device for identifying roles in a text, electronic equipment and a computer-readable storage medium; relates to natural language processing technology in the field of artificial intelligence; the method comprises the following steps: extracting a plurality of role candidate words from a text, and acquiring at least one matching parameter corresponding to each role candidate word; selecting at least one role candidate word from the plurality of role candidate words as a first candidate role entity according to at least one matching parameter corresponding to each role candidate word; performing fusion processing on the question corresponding to the text and the text to obtain a fusion text; performing entity identification processing on the fusion text to obtain at least one second candidate role entity; and carrying out role classification processing based on at least one first candidate role entity and at least one second candidate role entity to obtain roles in the text. By the method and the device, the role can be accurately and efficiently identified from the text.

Description

Method and device for identifying role in text, electronic equipment and storage medium

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a method and an apparatus for identifying a role in a text, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. As artificial intelligence technology has been researched and developed, artificial intelligence technology has been developed and applied in various fields.

In the case of character recognition in a text (for example, a novel) as an example, a character name in the text (hereinafter, simply referred to as a character name or a character) generally includes various title numbers, appellations, and the like, and is distinguished from a common person name. In the related art, when the characters in the text are identified through the universal name identification model, the problem of incomplete character identification caused by the particularity of the character types exists, namely the identified characters are low in accuracy, and unnecessary computing resources are consumed.

Therefore, the related art has no effective solution for accurately and efficiently identifying roles from texts.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying a role in a text, electronic equipment and a computer readable storage medium, which can accurately and efficiently identify the role from the text.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for identifying roles in a text, which comprises the following steps:

extracting a plurality of role candidate words from a text, and acquiring at least one matching parameter corresponding to each role candidate word;

selecting at least one role candidate word from the plurality of role candidate words as a first candidate role entity according to at least one matching parameter corresponding to each role candidate word;

performing fusion processing on the question corresponding to the text and the text to obtain a fusion text;

performing entity identification processing on the fusion text to obtain at least one second candidate role entity;

and carrying out role classification processing based on at least one first candidate role entity and at least one second candidate role entity to obtain roles in the text.

In the above scheme, the extracting entity features and text features from the entity sentence pair includes:

extracting a plurality of word vectors from the entity statement pair, and determining a mean value among the word vectors as the entity feature;

coding the entity statement according to the direction from the starting position to the ending position to obtain a forward coding vector;

coding the entity statement according to the direction from the end position to the start position to obtain a backward coding vector;

and carrying out fusion processing on the forward encoding vector and the backward encoding vector to obtain the text characteristics.

In the above scheme, the extracting a plurality of character candidate words from a text includes:

acquiring the text, and executing the following preprocessing on the text: dividing the text into a plurality of sentences according to the symbol list, and filtering out symbols in each sentence;

extracting a plurality of character candidate words from each sentence of the preprocessed text.

The embodiment of the present application provides a role recognition device in a text, including:

the first entity identification module is used for extracting a plurality of role candidate words from a text and acquiring at least one matching parameter corresponding to each role candidate word;

the first entity identification module is further configured to select at least one role candidate word from the plurality of role candidate words as a first candidate role entity according to at least one matching parameter corresponding to each role candidate word;

the second entity identification module is used for fusing the question corresponding to the text and the text to obtain a fused text;

the second entity identification module is further configured to perform entity identification processing on the fusion text to obtain at least one second candidate role entity;

and the classification module is used for performing role classification processing on the basis of at least one first candidate role entity and at least one second candidate role entity to obtain roles in the text.

In the above scheme, the types of the matching parameters include word frequency, degree of solidification and degree of freedom; the first entity identification module is further configured to perform the following processing for each of the role candidate words: determining the word frequency of the character candidate words in the text; dividing the role candidate words into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the role candidate words in the text, wherein the types of the morphemes comprise characters and words; determining the degree of solidification of the role candidate words according to the occurrence probability of each morpheme in the text and the occurrence probability of the role candidate words in the text; and determining the left information entropy and the right information entropy of the role candidate words, and determining the degree of freedom of the role candidate words according to the left information entropy and the right information entropy of the role candidate words.

In the above scheme, the first entity identification module is further configured to determine a plurality of left adjacent words and a plurality of right adjacent words of the character candidate word in the text; determining sub information entropy corresponding to each left adjacent word, and determining sub information entropy corresponding to each right adjacent word; determining the opposite number of the result of the summation of the sub-information entropies corresponding to each left adjacent word as the left information entropy, and determining the opposite number of the result of the summation of the sub-information entropies corresponding to each right adjacent word as the right information entropy; and when the left information entropy is larger than the right information entropy, determining the right information entropy as the degree of freedom, and when the left information entropy is not larger than the right information entropy, determining the left information entropy as the degree of freedom.

In the foregoing solution, the first entity identifying module is further configured to perform the following processing for each left adjacent word: determining the ratio of the occurrence frequency of the left adjacent character in the text to the occurrence frequency of all adjacent characters of the character candidate words in the text as a first ratio; carrying out logarithmic operation processing on the first ratio, and determining the product of a logarithmic operation result and the first ratio as a sub information entropy corresponding to the left adjacent word;

in the foregoing solution, the first entity identifying module is further configured to perform the following processing for each right adjacent word: determining the ratio of the occurrence times of the right adjacent characters in the text to the occurrence times of all adjacent characters of the character candidate words in the text as a second ratio; and carrying out logarithm operation processing on the second ratio, and determining the product of a logarithm operation result and the second ratio as the sub information entropy corresponding to the right adjacent word.

In the foregoing solution, the first entity identifying module is further configured to execute the following processing for each morpheme: determining the ratio of the occurrence frequency of the morphemes in the text to the occurrence frequency of all the role candidate words in the text as the occurrence probability of the morphemes in the text; and determining the ratio of the occurrence times of the role candidate words in the text to the occurrence times of all the role candidate words in the text as the occurrence probability of the role candidate words in the text.

In the above scheme, the first entity identifying module is further configured to multiply the occurrence probability of each morpheme in the text to obtain a multiplication result; determining the ratio of the occurrence probability of the role candidate words in the text to the product result as a third ratio; and carrying out logarithmic operation processing on the third ratio, and determining a logarithmic operation result as the degree of solidification of the role candidate words.

In the foregoing scheme, the first entity identifying module is further configured to select, as the first candidate role entity, a role candidate word that satisfies at least one of the following conditions from among the plurality of role candidate words: the word frequency of the role candidate words in the text exceeds a word frequency threshold, the degree of solidification of the role candidate words exceeds a degree of solidification threshold, and the degree of freedom of the role candidate words exceeds a degree of freedom threshold.

In the above scheme, the second entity identification module is further configured to perform feature extraction processing on the fused text to obtain a feature sequence; mapping the characteristic sequence to obtain at least one position set; performing the following for each of the location sets: combining the characters corresponding to the starting position in the position set, the characters between the starting position and the ending position in the position set, and the characters corresponding to the ending position in the position set, and determining a combination result as the second candidate role entity.

In the foregoing scheme, the second entity identifying module is further configured to divide the feature sequence into a plurality of sub-features, where the plurality of sub-features correspond to a plurality of words in the text one to one; mapping each sub-feature to a start probability belonging to a start position and an end probability belonging to an end position; selecting at least one sub-feature with the starting probability larger than a starting probability threshold as a starting sub-feature, and selecting at least one sub-feature with the ending probability larger than an ending probability threshold as an ending sub-feature; constructing at least one candidate start-stop feature set based on the selected at least one start sub-feature and at least one end sub-feature, wherein the candidate start-stop feature set comprises a start sub-feature and an end sub-feature; determining a target start-stop feature set in the at least one candidate start-stop feature set; determining characters corresponding to the starting sub-features in the target starting and stopping feature set as starting positions in the position set, and determining characters corresponding to the ending sub-features in the target starting and stopping feature set as ending positions in the position set.

In the foregoing solution, the second entity identifying module is further configured to perform the following processing for each candidate start-stop feature set: performing fusion processing on the starting sub-feature and the ending sub-feature in the candidate starting-ending feature set to obtain a first fusion feature, and mapping the first fusion feature to be the probability of belonging to the same entity; and in the at least one candidate start-stop feature set, determining the candidate start-stop feature set with the probability of belonging to the same entity larger than an entity probability threshold as the target start-stop feature set.

In the foregoing solution, the second entity identifying module is further configured to perform the following processing for each second candidate role entity: dividing the second candidate role entity into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text, wherein the types of the morphemes comprise words and expressions; determining the degree of solidification of the second candidate role entity according to the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text; determining the left information entropy and the right information entropy of the second candidate role entity, and determining the degree of freedom of the second candidate role entity according to the left information entropy and the right information entropy of the second candidate role entity; filtering, among the plurality of second candidate role entities, second candidate role entities that satisfy at least one of the following conditions: the word frequency of the second candidate role entity in the text does not exceed a word frequency threshold, the degree of solidification of the second candidate role entity does not exceed a degree of solidification threshold, and the degree of freedom of the second candidate role entity does not exceed a degree of freedom threshold.

In the foregoing solution, the classification module is further configured to filter out duplicate candidate role entities from the at least one first candidate role entity and the at least one second candidate role entity; and executing the following processing aiming at each candidate role entity obtained after filtering: determining the sentence where the candidate role entity is located, and combining the candidate role entity and the sentence to obtain an entity sentence pair; extracting entity features and text features from the entity sentence pair, and fusing the entity features and the text features to obtain second fusion features; mapping the second fusion feature to a probability of belonging to a role entity; and when the probability of the role entity is greater than a role probability threshold value and the word frequency of the candidate role entity in the text is greater than a role word frequency threshold value, determining that the candidate role entity is the role in the text.

In the above scheme, the classification module is further configured to extract a plurality of word vectors from the entity sentence pair, and determine a mean value between the plurality of word vectors as the entity feature; coding the entity statement according to the direction from the starting position to the ending position to obtain a forward coding vector; coding the entity statement according to the direction from the end position to the start position to obtain a backward coding vector; and carrying out fusion processing on the forward encoding vector and the backward encoding vector to obtain the text characteristics.

In the foregoing solution, the first entity identifying module is further configured to obtain the text, and perform the following preprocessing on the text: dividing the text into a plurality of sentences according to the symbol list, and filtering out symbols in each sentence; extracting a plurality of character candidate words from each sentence of the preprocessed text.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the role identification method in the text provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, which stores computer-executable instructions and is used for realizing the role identification method in the text provided by the embodiment of the application when being executed by a processor.

The embodiment of the present application provides a computer program product, where the computer program product includes computer-executable instructions, and is used for implementing the method for recognizing a role in a text provided by the embodiment of the present application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

candidate role entities are extracted from the text in two different modes, roles in the text are determined according to the extracted candidate role entities, diversity and comprehensiveness of the identified roles can be guaranteed, completeness of the identified roles can be guaranteed, and therefore efficiency and accuracy of the roles identified from the text are improved.

Drawings

Fig. 1 is an architecture diagram of a role recognition system 100 in text provided by an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 200 provided in an embodiment of the present application;

fig. 3 is a flowchart illustrating a method for identifying a role in a text according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a method for identifying a role in a text according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a method for identifying a role in a text according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a machine learning model provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a role recognition framework provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first/second" are only to distinguish similar items and do not denote a particular order, but rather the terms "first/second" may, where permissible, be interchanged with a particular order or sequence so that embodiments of the application described herein may be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

2) Term Frequency (TF) refers to the number of occurrences of a given word in the text.

3) The degree of solidification is used for measuring whether the collocation of a word is reasonable or not, for example, the collocation of the 'laugh telegraph' in the 'laugh movie inventory' is more reasonable than that of the 'laugh telegraph', so that the degree of solidification of the 'movie' is higher than that of the 'laugh telegraph'.

4) The degree of freedom is used for measuring the richness of left adjacent characters and right adjacent characters of a word, the degree of freedom of the word can be represented by using the information entropy of the word, and the higher the degree of freedom is, the higher the richness of the left adjacent characters and the right adjacent characters representing the word is.

The character names in the text generally include various title numbers, appellations, etc., and are distinguished from common human names. The related art mainly relies on a general named entity recognition model, such as a sequence tagging-based model, wherein the sequence tagging-based model uses a Long Short-Term Memory network (LSTM), a machine translation (transform) model, a Bidirectional Encoder characterization (BERT) model, etc. as a basic model framework, and overlaps a Conditional Random Field (CRF), and uses tagging results to recognize entities and entity types.

The applicant finds the following technical problems in the related art in the implementation process:

the universal named entity recognition model is mainly used for text input at paragraph and sentence levels, and has low compatibility for complete text (such as novel); when the role in the text is identified through the universal named entity identification model, the problem of incomplete role identification caused by the particularity of the role type exists.

In view of the above technical problems, embodiments of the present application provide a method for identifying a character in a text, which can accurately and efficiently identify the character from the text. The following describes an exemplary application of the method for recognizing a character in a text provided by the embodiment of the present application, which may be implemented by various electronic devices, for example, may be implemented by a terminal alone, for example, the terminal recognizes a character from a text by using its own computing capability; it may also be implemented by the terminal and the server in cooperation, for example, the terminal recognizes the character from the text by means of the computing power of the server.

Next, an embodiment of the present application is described by taking a server and a terminal as an example, and referring to fig. 1, fig. 1 is a schematic structural diagram of a role recognition system 100 in a text provided by the embodiment of the present application. The system 100 for recognizing characters in text includes: the server 200, the network 300, and the terminal 400 will be separately described.

The server 200 is a background server of the client 410, and is configured to, in response to a text reading request of the client 410, obtain a corresponding text according to a text identifier (e.g., a novel name, a novel number, an article name, and the like) in the text reading request, where the text may be a novel, news, an article, and the like; the method is also used for carrying out role identification on the text to obtain roles in the text; and also for sending the corresponding text and the role in the text to the client 410 for presentation.

The network 300, which is used as a medium for communication between the server 200 and the terminal 400, may be a wide area network or a local area network, or a combination of both.

The terminal 400 is used for operating a client 410, and the client 410 is a client with a text reading function, such as a news client, a browser, a novel client, and the like. A client 410, configured to send a text reading request to the server 200 in response to a text reading operation of a user; and the system is further configured to receive the text and the characters in the text sent by the server 200 and present the characters in the text on the human-computer interaction interface, for example, present the characters in the text as a character list in a text introduction on the human-computer interaction interface, so as to speed up the understanding of the user on the novel.

As one example, it can be used in a presentation scenario of a people relationship graph. The server 200 performs role recognition on the text to obtain a plurality of roles in the text, determines the relationship among the roles, automatically constructs a character relationship map of the text, and sends the character relationship map to the client 410; the client 410 displays the character relationship graph in the human-computer interaction interface, so as to provide reference for reading and text analysis of the user.

As another example, it may be used in an application scenario for text recommendation. The server 200 performs role recognition on the text to obtain a plurality of roles in the text, and determines character attributes (such as characters, skills and the like) of each role; the server 200 acquires user information (for example, a user portrait) sent by the client 410, determines a role matched with the user information according to the character attribute of the role, and recommends a text containing the role to the user, thereby improving the efficiency of text recommendation.

The embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying series resources such as hardware, software, and network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.

As an example, the server 200 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be various types of user terminals such as a smart phone, a tablet computer, a vehicle-mounted terminal, an intelligent wearable device, a notebook computer, and a desktop computer. The terminal 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

Next, the structure of the server 200 in fig. 1 is explained. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 240, and at least one network interface 220. The various components in server 200 are coupled together by a bus system 230. It is understood that the bus system 230 is used to enable connected communication between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 230 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 240 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 240 described in embodiments herein is intended to comprise any suitable type of memory. Memory 440 optionally includes one or more storage devices physically located remotely from processor 210.

In some embodiments, memory 240 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, to support various operations, as exemplified below.

An operating system 241, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks; a network communication module 242 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the character recognition device in the text provided by the embodiments of the present application may be implemented in software, and fig. 2 illustrates the character recognition device 243 in the text stored in the memory 240, which may be software in the form of programs and plug-ins, and includes the following software modules: a first entity identification module 2431, a second entity identification module 2432, and a classification module 2433. These modules may be logical functional modules and thus may be arbitrarily combined or further divided according to the functions implemented. The functions of the respective modules will be explained below.

In the following, a role recognition method in the text provided by the embodiment of the present application is performed by the server 200 in fig. 1 as an example. Referring to fig. 3, fig. 3 is a flowchart illustrating a method for recognizing a character in text according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

In step S101, a plurality of character candidate words are extracted from the text.

In some embodiments, a plurality of words in the text are sequentially intercepted through a plurality of sliding windows with different lengths; and determining the relevance of each word, and determining the words with the relevance exceeding a relevance threshold as character candidate words.

As an example of accepting fig. 6, fig. 6 is a schematic structural diagram of a machine learning model provided in an embodiment of the present application, and a plurality of words are sequentially intercepted in a text by a first candidate character entity identification model 601, a relevancy of each word is determined, and a word whose relevancy exceeds a relevancy threshold is determined as a character candidate word.

As an example, the first candidate character entity recognition model 601 may be a Neural Network model, which may include various types, such as a Convolutional Neural Network (CNN) model, a Recurrent Neural Network (RNN) model, and a chinese language model (N-Gram).

As an example, the relevance is used to measure whether a word is reasonable in collocation, for example, a "movie" in "make a fun movie inventory" is more reasonable than a "laugh-up" collocation, and thus the relevance of the "movie" is higher than the solidity of the "laugh-up". The correlation threshold may be a parameter obtained during the training of the machine learning model, or may be a value set by a user, a client, or a server.

In some embodiments, text is obtained and the following pre-processing is performed on the text: dividing the text into a plurality of sentences according to the symbol list, and filtering out symbols in each sentence; a plurality of character candidate words are extracted from each sentence of the preprocessed text.

By way of example, the symbol list includes a plurality of punctuation symbols that may divide the sentence, such as periods, semicolons, commas, question marks, exclamation marks, and the like.

Taking the text as a novel, the novel is a text with an ultra-long length and contains various symbols, and some preprocessing is needed to perform the subsequent role candidate word extraction process. Taking fig. 7 as an example, the new word discovery model in fig. 7 is an example of the first candidate character entity recognition model 601 in fig. 6, the named entity recognition model in fig. 7 is an example of the second candidate character entity recognition model 602 in fig. 6, and a text preprocessing module may be added before the new word discovery model and the named entity recognition model. The text preprocessing module comprises a symbol processing module and a text sentence cutting module, the text sentence cutting module is used for cutting the novel text into a plurality of sentences according to a symbol list which is constructed in advance, and the text symbol processing module is used for filtering symbols in the plurality of segmented sentences.

For example, play with the text "weather today is really good, little wisdom wants to go out little red. But since Xiao hong is written today, Xiao Ming is rejected "and the symbol list contains periods as an example, the text is first divided into" today is really good in weather, Xiao Ming wants to go out with little hong. "and" but small red is doing the job today, so small bright "two sentences are rejected, and then the signs of the two sentences are filtered to get" today weather is really good and small bright wants about small red to go out "and" but small red is doing the job today and small bright is rejected ".

According to the method and the device, the text is preprocessed, so that the interference of symbols on the role candidate word extraction process can be avoided, the text input at paragraph and sentence levels can be processed, the complete text (such as novel) input can be processed, the compatibility is strong, and the speed and the accuracy of extracting a plurality of role candidate words from the text can be improved.

In step S102, at least one matching parameter corresponding to each character candidate word is obtained.

As an example of fig. 6, at least one matching parameter corresponding to each character candidate word is obtained through the first candidate character entity recognition model 601, where the types of the matching parameters include word frequency, degree of solidity, and degree of freedom.

In some embodiments, referring to fig. 4, fig. 4 is a flowchart illustrating a role identification method in text provided in an embodiment of the present application, and based on fig. 3, step S102 may include steps S1021 to S1024.

In step S1021, the word frequency of the character candidate word in the text is determined.

In some embodiments, word frequency refers to the number of times a given word appears in the text.

The text is' today, the weather is really good, the xiao ming wants to go out in small red. However, since xiaohong is written today, xiaohang is rejected, and the character candidate words are "xiaohang" and "xiaohong" as examples, the word frequency of "xiaohang" is 2, and the word frequency of "xiaohong" is 2.

In step S1022, the character candidate word is divided into a plurality of morphemes, and the occurrence probability of each morpheme in the text and the occurrence probability of the character candidate word in the text are determined.

As an example, the morpheme type includes words and expressions, and taking the role candidate word as "Xiaoming", the "Xiaoming" can be divided into two morphemes, namely "small" and "bright"; taking the example that the character candidate word is "football pitch", the "football pitch" can be divided into two morphemes of "football" and "pitch" or two morphemes of "foot" and "pitch".

In some embodiments, the character candidate words are divided into morphemes, and the following processing is performed for each morpheme: determining the ratio of the occurrence frequency of the morphemes in the text to the occurrence frequency of all the role candidate words in the text as the occurrence probability of the morphemes in the text; and determining the ratio of the occurrence times of the role candidate words in the text to the occurrence times of all the role candidate words in the text as the occurrence probability of the role candidate words in the text.

For example, play with the text "weather today is really good, little wisdom wants to go out little red. However, since xiaohong is written today, xiaohang is rejected, and the character candidate words are "xiaohang" and "xiaohong" as examples, the "xiaohang" may be divided into two morphemes "small" and "bright", and the "xiaohong" may be divided into two morphemes "small" and "red", the number of occurrences of the morpheme "small" in the text is 4, the number of occurrences of the morpheme "bright" in the text is 2, the number of occurrences of the morpheme "red" in the text is 2, the number of occurrences of the character candidate word "xiaohang" is 2, the number of occurrences of the character candidate word "xianhong" in the text is 4. Thus, the occurrence probability of the morpheme "small" in the text is 4/4 ═ 1, the occurrence probability of the morpheme "bright" in the text is 2/4 ═ 0.5, and the occurrence probability of the character candidate word "small bright" in the text is 2/4 ═ 0.5.

In step S1023, the degree of solidity of the character candidate words is determined according to the occurrence probability of each morpheme in the text and the occurrence probability of the character candidate words in the text.

In some embodiments, the occurrence probability of each morpheme in the text is multiplied to obtain a multiplication result; determining the ratio of the occurrence probability of the role candidate words in the text to the product result as a third ratio; and carrying out logarithmic operation processing on the third ratio, and determining the logarithmic operation result as the degree of solidification of the role candidate words.

For example, play with the text "weather today is really good, little wisdom wants to go out little red. However, since xiao hong is written today, xiao ming is rejected, and the character candidate words are "xiao ming" and "xiao hong" as examples, the probability of occurrence of the morpheme "xiao" in the text is 4/4 ═ 1, the probability of occurrence of the morpheme "ming" in the text is 2/4 ═ 0.5, and the probability of occurrence of the character candidate word "xiao ming" in the text is 2/4 ═ 0.5, so that the degree of solidity of the character candidate word "xiao ming" is 0.5/(1 × 0.5) ═ 1.

In some embodiments, the high degree of solidification indicates that the probability of co-occurrence of two morphemes is much greater than the probability of the product of free concatenation of the two morphemes, indicating that the collocation of the two morphemes is more reasonable. If a character candidate word contains a plurality of matching combinations, for example, the character candidate word "football court" may be composed of the morpheme "football" and the morpheme "court", or the morpheme "football" and the morpheme "court", so that the value with the minimum coagulability of all combinations of character candidate words can be taken as the coagulability of the character candidate word.

For example, when the character candidate word "football court" is divided into morpheme "football" and morpheme "court", the degree of solidity of the character candidate word "football court" is 0.7; when the character candidate word "football pitch" is divided into the morpheme "football" and the morpheme "court", the degree of solidity of the character candidate word "football pitch" is 0.6, and at this time, the degree of solidity of the character candidate word "football pitch" may be determined to be 0.6.

In step S1024, the left information entropy and the right information entropy of the character candidate word are determined, and the degree of freedom of the character candidate word is determined according to the left information entropy and the right information entropy of the character candidate word.

In some embodiments, a plurality of left adjacent words and a plurality of right adjacent words of the character candidate words in the text are determined; determining the sub information entropy corresponding to each left adjacent word, and determining the sub information entropy corresponding to each right adjacent word; determining the opposite number of the result of the summation of the sub-information entropies corresponding to each left adjacent word as a left information entropy, and determining the opposite number of the result of the summation of the sub-information entropies corresponding to each right adjacent word as a right information entropy; and when the left information entropy is larger than the right information entropy, determining the right information entropy as the degree of freedom, and when the left information entropy is not larger than the right information entropy, determining the left information entropy as the degree of freedom.

For example, play with the text "weather today is really good, little wisdom wants to go out little red. However, since xiaohong is written today, xiaohang is rejected, and the character candidate word is "xiaohang" and "xiaohong" as an example, the left adjacent word of the character candidate word "xiaohang" in the text includes "good" and "yes", the right adjacent word of the character candidate word "xiaohang" in the text includes "want", the left adjacent word of the character candidate word "xiaohang" in the text includes "about" and "yes", and the right adjacent word of the character candidate word "xiaohang" in the text includes "out" and "now".

As an example, determining the sub-information entropy corresponding to each left adjacent word may include: the following processing is performed for each left adjacent word: determining the ratio of the occurrence frequency of the left adjacent character in the text to the occurrence frequency of all adjacent characters of the character candidate words in the text as a first ratio; and carrying out logarithmic operation processing on the first ratio, and determining the product of the logarithmic operation result and the first ratio as the sub information entropy corresponding to the left adjacent word.

As an example, determining the sub-information entropy corresponding to each right adjacent word may include: the following processing is performed for each right adjacent word: determining the ratio of the occurrence frequency of the right adjacent character in the text to the occurrence frequency of all adjacent characters of the character candidate words in the text as a second ratio; and carrying out logarithm operation processing on the second ratio, and determining the product of the logarithm operation result and the second ratio as the sub information entropy corresponding to the right adjacent word.

For example, the degrees of freedom of the character candidate words are used to measure the richness of the left-adjacent characters and the right-adjacent characters of one character candidate word, and may be calculated by using formula (1).

free(w_i)＝min{le(w_i),re(w_i)} (1)

Wherein, le (w)_i) Left entropy, re (w) representing character candidate word c_i) The right information entropy of the character candidate word c is represented, and the left information entropy and the right information entropy of the character candidate word are calculated in the following modes:

where x represents the left adjacent word of the character candidate word c, and p (x) represents the probability of x in all adjacent words (i.e., the number of times x appears in the text/the number of times all adjacent words appear in the text); y represents the right adjacent word of the character candidate word c, and p (y) represents the probability of y in all adjacent words (i.e., the number of occurrences of y in the text/the number of occurrences of all adjacent words in the text).

In step S103, at least one role candidate word is selected from the plurality of role candidate words as a first candidate role entity according to at least one matching parameter corresponding to each role candidate word.

In some embodiments, a role candidate word that satisfies at least one of the following conditions is selected as the first candidate role entity from among the plurality of role candidate words: the word frequency of the character candidate words in the text exceeds a word frequency threshold, the degree of solidification of the character candidate words exceeds a degree of solidification threshold, and the degree of freedom of the character candidate words exceeds a degree of freedom threshold.

As an example of accepting fig. 6, a role candidate word satisfying at least one of the following conditions is selected as a first candidate role entity from a plurality of role candidate words through the first candidate role entity identification model 601: the word frequency of the character candidate words in the text exceeds a word frequency threshold, the degree of solidification of the character candidate words exceeds a degree of solidification threshold, and the degree of freedom of the character candidate words exceeds a degree of freedom threshold.

For example, the word frequency threshold, the freezing degree threshold, and the degree of freedom threshold may be parameters obtained during the training process of the machine learning model, or may be values set by a user, a client, or a server.

Only the word frequency is used for filtering the role candidate words, and some role candidate words with higher frequency but incomplete are easy to be regarded as the first candidate role entities. Therefore, the role candidate words are selected according to the degree of freedom, the degree of solidity and the word frequency of the role candidate words, the integrity of the identified first candidate role entity can be guaranteed, and therefore the efficiency and the accuracy of identifying the role from the text are further improved.

In step S104, a question and a text corresponding to the text are fused to obtain a fused text.

As an example of accepting fig. 6, a question sentence and a text corresponding to the text are subjected to fusion processing (for example, concatenation) by the second candidate role entity recognition model 602, so as to obtain a fused text.

For example, the second candidate character entity recognition model 602 may be a neural network model, which may include various types, such as a convolutional neural network model, a recurrent neural network model, and a BERT-MRC model, among others.

For example, first, a question query corresponding to a text is constructed, and a word sequence Q ═ Q is formed₁q₂…q_NWherein q is_iThe ith word in the query is shown, and N is the length of the question. Considering that only candidate role entities need to be extracted in the embodiment of the present application, and entity type classification is not needed, the constructed question sentence may be "an entity refers to a concrete object that exists objectively, and generally refers to an actually existing and functional person, organization, and organization, etc. For text, the formalized representation is a word sequence W ═ W₁w₂…w_MWherein w is_iRepresenting the ith word in the text and M representing the length of the text. Splicing the question query and the text to form a fusion text, wherein the fusion text is formally expressed as [ CLS]q₁q₂…q_N[SEP]w₁w₂…w_M[SEP]。

In the embodiment of the application, because the constructed question does not include the type of the entity, the fusion speed of the question and the text can be increased, and the speed of subsequently extracting the second candidate role entity can also be increased.

In some embodiments, the following pre-processing is performed on the text: dividing the text into a plurality of sentences according to the symbol list, and filtering out symbols in each sentence; and performing fusion processing on the preprocessed text and the question sentence corresponding to the text to obtain a fusion text. Here, the process of performing the preprocessing on the text is similar to the process of performing the preprocessing on the text in step S101, and will not be described again here.

In step S105, entity identification processing is performed on the fused text to obtain at least one second candidate role entity.

As an example of accepting fig. 6, entity recognition processing is performed on the fused text through the second candidate role entity recognition model 602, so as to obtain at least one second candidate role entity.

In some embodiments, referring to fig. 5, fig. 5 is a flowchart illustrating a role identification method in the text provided in the embodiment of the present application, and based on fig. 3, step S105 may include steps S1051 to S1053.

In step S1051, feature extraction processing is performed on the fused text to obtain a feature sequence.

In some embodiments, the fused text is divided into a plurality of morphemes, each morpheme is subjected to feature extraction processing to obtain a plurality of sub-features corresponding to the plurality of morphemes one by one, and the plurality of sub-features are combined to obtain a feature sequence.

As an example, the fused text is subjected to a feature extraction process to obtain a feature sequence, which is expressed as H ═ H₁h₂…h_LWherein h is_i∈R^dAn output vector representing the ith node (i.e., the morpheme described above), and d represents a vector dimension.

In step S1052, the feature sequence is mapped to obtain at least one position set.

Here, the position set includes a start position and an end position.

In some embodiments, the feature sequence is divided into a plurality of sub-features, wherein the plurality of sub-features correspond to a plurality of words in the text one to one; mapping each sub-feature to a start probability belonging to a start position and an end probability belonging to an end position; selecting at least one sub-feature with the starting probability larger than the starting probability threshold as a starting sub-feature, and selecting at least one sub-feature with the ending probability larger than the ending probability threshold as an ending sub-feature; constructing at least one candidate start-stop feature set based on the selected at least one start sub-feature and at least one end sub-feature, wherein the candidate start-stop feature set comprises a start sub-feature and an end sub-feature; determining a target start-stop feature set in at least one candidate start-stop feature set; determining characters corresponding to the starting sub-features in the target starting and stopping feature set as starting positions in the position set, and determining characters corresponding to the ending sub-features in the target starting and stopping feature set as ending positions in the position set.

As an example, determining the target start-stop feature set among the at least one candidate start-stop feature set may include: performing the following for each candidate start-stop feature set: performing fusion processing on the initial sub-feature and the ending sub-feature in the candidate start-stop feature set to obtain a first fusion feature, and mapping the first fusion feature to the probability of belonging to the same entity; and in at least one candidate start-stop feature set, determining the candidate start-stop feature set with the probability of belonging to the same entity larger than the entity probability threshold as a target start-stop feature set.

As an example, the start probability threshold, the end probability threshold, and the entity probability threshold may be parameters obtained during training of the machine learning model, or may be values set by a user, a client, or a server.

For example, two tag sequences with length L can be used to determine the starting position and the ending position of the second candidate role entity, which are respectively labeled as

Respectively indicating whether the position i is a start position and an end position of an entity. Also, an L by L matrix may be used

Wherein the content of the first and second substances,

it is indicated whether or not a character string having a start position and an end position i, j, respectively, is an entity.

The calculation of (c) is as follows:

P^s＝Softmax(H·T_b) (4)

P^e＝Softmax(H·T_e) (5)

wherein, P^s∈R^L×2Representing the probability, P, that each position is a starting position^e∈R^L×2Indicates the probability that each position is an end position, T_b，T_e∈R^d×2Are the model parameters.

Based on

The calculation method of (c) is as follows:

finally, by

Will be provided with

The character string of 1 is output as the second candidate character entity.

In step S1053, the following processing is performed for each position set: combining the characters corresponding to the starting position in the position set, the characters between the starting position and the ending position in the position set and the characters corresponding to the ending position in the position set, and determining a combination result as a second candidate role entity.

Taking the text of "the young a sees the mu BC drinking in the pub" as an example, the starting position in the position set is "mu", and the ending position is "C", so that the "mu BC" can be determined to be a second candidate role entity.

Compared with the method for identifying the roles in the text through a universal named entity identification model, the method for identifying the second candidate role entity can solve the problem that the role entity is identified incompletely due to the particularity of the role type, and therefore the integrity of the identified second candidate role entity is improved.

In some embodiments, the following process may also be performed for each second candidate role entity after step S105: dividing the second candidate role entity into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text; determining the degree of solidification of the second candidate role entity according to the probability of occurrence of each morpheme in the text and the probability of occurrence of the second candidate role entity in the text; determining the left information entropy and the right information entropy of the second candidate role entity, and determining the degree of freedom of the second candidate role entity according to the left information entropy and the right information entropy of the second candidate role entity; filtering, among the plurality of second candidate role entities, second candidate role entities that satisfy at least one of the following conditions: the word frequency of the second candidate role entity in the text does not exceed the word frequency threshold, the degree of solidification of the second candidate role entity does not exceed the degree of solidification threshold, and the degree of freedom of the second candidate role entity does not exceed the degree of freedom threshold.

As an example, here, the process of filtering the second candidate role entity according to the matching parameter and the process type of selecting the first candidate role entity according to the matching parameter in step S103 will not be described again.

In the embodiment of the application, the problem that the extracted second candidate role entity may have entity incompleteness is considered, and the matching parameters obtained in the step S102 are used for filtering the second candidate role entity, so that the efficiency and accuracy of identifying the role from the text can be further improved.

In step S106, a role classification process is performed based on at least one first candidate role entity and at least one second candidate role entity, so as to obtain a role in the text.

As an example of accepting fig. 6, the role classification model 603 performs role classification processing based on at least one first candidate role entity and at least one second candidate role entity to obtain a role in the text.

For example, the character classification model 603 may be a neural network model, which may include various types, such as a convolutional neural network model, a cyclic neural network model, and a bidirectional Long Short-Term attention network (BilSTM).

In some embodiments, duplicate candidate role entities are filtered out (e.g., a union) among at least one first candidate role entity and at least one second candidate role entity; and executing the following processing aiming at each candidate role entity obtained after filtering: determining the sentence where the candidate role entity is located, and combining the candidate role entity and the sentence to obtain an entity sentence pair; extracting entity features and text features from the entity sentence pair, and performing fusion processing (for example, splicing) on the entity features and the text features to obtain second fusion features; mapping the second fusion feature to a probability of belonging to a role entity; and when the probability of the character entity is greater than the character probability threshold value and the word frequency of the candidate character entity in the text is greater than the character word frequency threshold value, determining that the candidate character entity is the character in the text.

As an example, extracting entity features and text features from the entity sentence pair may include: extracting a plurality of word vectors from the entity sentence pair, and determining the average value among the word vectors as an entity characteristic; coding the entity statement according to the direction from the starting position to the ending position to obtain a forward coding vector; coding the entity statement according to the direction from the end position to the start position to obtain a backward coding vector; and carrying out fusion processing on the forward coding vector and the backward coding vector to obtain text characteristics.

As an example of fig. 7, the following processing is performed for each candidate role entity obtained after filtering: and constructing all < entity, text > pairs by using the candidate role entities and the texts, wherein the entities represent the candidate role entities, and the texts represent the sentences in which the candidate role entities are located. And obtaining the category of the candidate role entity by utilizing the pre-trained novel role entity classification model. The novel character entity classification model takes < entity, text > pair as input, and is formally represented as < M, C >, wherein M represents candidate character entities, and C represents novel sentences in which M is located.

For example, the novel role entity classification model respectively extracts the entity features and the text features of each input, and the formalized representation is H^m，H^c. The process of feature extraction specifically comprises: firstly, a vector matrix of an input text is obtained by using a pre-trained word vector and is marked as X^m，X^cThe extraction process of the entity features is as follows: directly averaging all word vectors of the entity to obtain entity characteristics; the text features can be extracted by using BilSTM, and the specific calculation mode is as follows:

H^c＝BiLSTM(X^c) (11)

wherein L is_mIs the physical length.

For example, concatenating entity features and text features as classification features

(i.e., the second fusion feature described above), and determining whether the entity is a novel role by using the Softmax layer, the classification result is expressed as:

P＝Softmax(H·W+b) (12)

wherein P ∈ R²Representing the probability that the candidate character entity is a novel character, W, b are model parameters,

are categories of entities predicted by the model (including whether the candidate role entity is a novel role, and whether the candidate role entity is a novel role). Thus, the character in the text can be obtained.

According to the method and the device, the roles in the candidate role entities are identified through the independent role classification model, the integrity of the identified roles can be further ensured, and therefore the efficiency and the accuracy of identifying the roles from the text are improved.

The following describes a role identification method in a text provided in an embodiment of the present application, taking a text as an example.

A novel generally comprises a plurality of different roles, and the relationships among the roles are interrelated. The role recognition technology can be used for extracting different roles in the novel, and the extracted roles are displayed as a character list in the novel brief introduction, so that the comprehension of the novel by a user is accelerated; the method can also be used for extracting a plurality of different roles in the novel and then determining the relationship among the different roles, thereby automatically constructing a figure relationship map of the novel and providing reference for reading and text analysis of a user; the method can also be used for extracting a plurality of different roles in the novel, then determining the character attributes of the different roles, determining the role matched with the user portrait according to the character attributes of the roles, and recommending the novel containing the role to the user, thereby improving the recommendation efficiency; the method can also be used for establishing the mapping relation between the novel and the roles after extracting a plurality of different roles in the novel, so that the novel with the mapping relation between the novel and the searched roles can be directly displayed to the user when the user searches the roles through a search engine, and the hit rate of the search is improved.

The embodiment of the application mainly includes a candidate role entity identification module (including a new word discovery model and a named entity identification model) and a novel role entity classification model, and specifically, for a novel text, a new word discovery model (i.e., the first candidate role entity identification model 601) and a named entity identification model (i.e., the second candidate role entity identification model 602) are used to extract candidate role entities, and a novel role entity classification model (i.e., the role classification model 603) is used to extract the novel role entities from the candidate role entities.

The embodiment of the application simultaneously uses the new word discovery model and the named entity recognition model to solve the problems of incompleteness and incompleteness of the entities extracted in the related technology, and uses the independent novel role entity classification model to effectively improve the recall rate and the accuracy rate of the novel roles.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a role recognition framework provided in the embodiment of the present application, where the role recognition framework includes a text preprocessing module, an entity recognition module, a novel role entity classification model, and an entity frequency statistics module, which will be separately described below.

Text preprocessing module

The novel text is a text with an ultra-long length and contains various symbols, and some preprocessing is needed to carry out the subsequent identification process. The text preprocessing module comprises a symbol processing module and a text sentence cutting module, and the text sentence cutting module is used for cutting the novel text into a plurality of sentences according to a symbol list which is constructed in advance. The text symbol processing module is used for cleaning symbols in the novel text, and various symbols in the text need to be filtered out because the new words found in the new word finding model do not contain various symbols.

(II) entity recognition module

The novel body contains a large number of entities of various types, such as personas, organizations, places, etc. In order to guarantee the number of extracted novel characters (extract all characters in the novel as much as possible) and the quality (guarantee that the accuracy of extracting characters is as high as possible). Firstly, all candidate character entities are extracted from the novel text, and specifically, the candidate character entities in the novel are extracted by using the new word discovery model and the named entity recognition model at the same time.

(1) New word discovery model

The new word discovery model aims at exploring some linguistic features through an unsupervised approach to determine which character strings in the novel text are likely entities.

In some embodiments, a novel text D ═ s composed of a set of sentences is given₁，s₂，…，s_NIn which s is_iDenotes the ith sentence, and N denotes the number of sentences. Firstly, acquiring all role candidate words by using a Chinese language model (N-Gram), and counting the word frequency of all the role candidate words. N-Gram refers to taking all strings of a particular length using a sliding window. The embodiment formalizes all character candidate words as W { (W)₁，n₁)，…，(w_i，n_i)，…，(w_M，n_M) Wherein (w)_i，n_i) And M represents the number of the character candidate words.

As an example, filtering the character candidate words only using word frequency makes it easy to regard some character candidate words with higher frequency but not complete as candidate character entities. Therefore, the degree of freedom and the degree of solidity of the character candidate words are calculated simultaneously.

In some embodiments, the degree of freezing is used to measure whether a match of a character candidate word is reasonable, e.g., a "movie" in "make-a-smile movie inventory" is more reasonable than a "make-a-smile" match. The degree of solidification is calculated as follows:

wherein p (x) represents the probability of x appearing in the text (i.e., the number of times x appears in the text/the number of times all character candidate words appear in the text); p (y) represents the probability of y occurring in the text (i.e., the number of occurrences of y in the text/the number of occurrences of all character candidate words in the text); p (x, y) represents the probability of xy occurring in the text (i.e., the number of occurrences of xy in the text/the number of occurrences of all character candidate words in the text).

The high degree of solidification indicates that the probability of co-occurrence of two words is far greater than the probability of the product of free splicing of the two words, and the collocation of the two words is more reasonable. If a word contains a plurality of matching combinations, for example, the football field can be composed of football and football field, and also can be composed of football and football field, so that the value with the minimum freezing degree of all the combinations can be taken as the freezing degree of the whole word. The formalized representation is:

wherein the content of the first and second substances,

representing character candidate word w_iThe left part of the collocation combination of (1),

respectively represent character candidate words w_iThe right part of the collocation combination of (1).

In some embodiments, the degree of freedom is used to measure the richness of the left-adjacent character and the right-adjacent character of a word, and the degree of freedom can be represented by using the information entropy, and the calculation method is as follows:

free(w_i)＝min{le(w_i),re(w_i)} (16)

wherein, le (w)_i) Left entropy, re (w) representing character candidate word c_i) And representing the right information entropy of the character candidate word c in the following calculation mode:

In some embodiments, the character candidate words are filtered through a preset word frequency threshold, a freezing degree threshold and a freedom degree threshold, for example, the character candidate words with the word frequency not exceeding the word frequency threshold, the freezing degree not exceeding the freezing degree threshold and the freedom degree not exceeding the freedom degree threshold are filtered, so that candidate character entities can be screened out from the character candidate words.

In some embodiments, the new word discovery model is not limited to the specific model structure described above, and may be other neural network models, which are not described herein again.

(2) Named entity recognition model

In some embodiments, the named entity recognition model is different from a traditional sequence tagging-based method, and the embodiment of the application adopts a named entity recognition model based on Machine Reading Comprehension (MRC), specifically, a BERT-MRC model can be used, a question is constructed for an input text, the question and the text are spliced to be used as input, features are extracted by using the BERT model, and whether a certain character string is an entity or not is directly judged based on the extracted features.

As an example, a question query is first constructed, in the form of a word sequence Q ═ Q₁q₂…q_NWherein q is_iThe ith word in the query is shown, and N is the length of the question. Considering that only candidate role entities need to be extracted in the embodiment of the present application, and entity type classification is not needed, the constructed question sentence may be "an entity refers to a concrete object that exists objectively, and generally refers to an actually existing and functional person, organization, and organization, etc.

As an example, for one inputThe text, formalized as a sequence of words W ═ W₁w₂…w_MWherein w is_iRepresenting the ith word in the text and M representing the length of the text. Splicing the question sentence and the text as BERT model input, and formally expressing as [ CLS]q₁q₂…q_N[SEP]w₁w₂…w_M[SEP]. Extracting features through a BERT model, wherein the features are expressed as H ═ H₁h₂…h_LWherein h is_i∈R^dRepresents the output vector of the ith node, and d represents the vector dimension.

As an example, based on the output of the BERT model, two tag sequences of length L may be used to determine the start-stop position of an entity, denoted as

The calculation of (c) is as follows:

P^s＝Softmax(H·T_b) (19)

P^e＝Softmax(H·T_e) (20)

wherein, P^s∈R^L×2Representing the probability, P, that each position is a starting position^e∈R^L×2Indicates the probability that each position is an end position, T_b,T_e∈R^d×2Are the model parameters.

Based on

The calculation method of (c) is as follows:

finally, by

Will be provided with

The character string of 1 is output as a candidate character entity.

In the embodiment of the application, in consideration of the problem that the candidate role entities extracted by the named entity recognition model are incomplete, the candidate role entities generated by the named entity recognition module can be filtered by using the word frequency threshold, the solidification threshold and the degree of freedom threshold calculated in the new word discovery module, for example, the candidate role entities with the word frequency not exceeding the word frequency threshold, the solidification threshold not exceeding the solidification threshold and the degree of freedom not exceeding the degree of freedom threshold are filtered.

In some embodiments, the named entity recognition model is not limited to the specific model structure described above, and may be other neural network models, which are not described herein again.

Novel role entity classification model

In some embodiments, all < entity, text > pairs are constructed using the candidate character entities and the novel body, where the entities represent the candidate character entities and the text represents the novel sentences in which the candidate character entities are located. And obtaining the category of the candidate role entity by utilizing the pre-trained novel role entity classification model. The novel character entity classification model takes < entity, text > pair as input, and is formally represented as < M, C >, wherein M represents candidate character entities, and C represents novel sentences in which M is located.

By way of example, the novel role entity classification model extracts entity features and text features of each input respectively, and the formalized representation is H^m,H^c. The process of feature extraction specifically comprises: firstly, a vector matrix of an input text is obtained by using a pre-trained word vector and is marked as X^m,X^cThe extraction process of the entity features is as follows: directly averaging all word vectors of the entity to obtain entity characteristics; text features are extracted by using BilSTM, and the specific calculation mode is as follows:

H^c＝BiLSTM(X^c) (26)

wherein L is_mIs the physical length.

As an example, entity features are concatenated with text features as classification features

P＝Softmax(H·W+b) (27)

are categories of entities predicted by the model (including whether the candidate role entity is a novel role, and whether the candidate role entity is a novel role).

In some embodiments, the novel role entity classification model is not limited to the above specific model structure, and may be other neural network models, which are not described herein again.

(IV) entity frequency statistic module

In some embodiments, based on the entity classification result, all role entities can be screened out and frequency statistics can be performed. The formal representation is (r, p), wherein r and p respectively represent the role entity and the corresponding frequency, and the entity meeting the requirement is screened out as the novel role through the filtering of a preset frequency threshold value.

According to the embodiment of the application, the process of extracting the character role by the traditional named entity recognition model is divided into two parts of candidate role entity extraction and novel role entity classification. The embodiment of the application can be applied to a novel role extraction task and an intention identification service in the novel vertical knowledge graph construction process. In the candidate role entity extraction part, the integrity and the recall rate of the extracted candidate role entities are improved by using a new word discovery model and combining a named entity recognition model based on a reading understanding frame. And the accuracy of the extracted novel role is improved by using the single novel role entity classification model. Through tests, the accuracy rate of the extracted novel roles reaches 96.7%, and the coverage rate of the novel roles on the novel query in the search engine reaches 70%.

An exemplary structure of the character recognition apparatus implemented as a software module in the text provided by the embodiment of the present application is described below with reference to fig. 2.

In some embodiments, as shown in fig. 2, the software modules in the character recognition device 243 stored in the text of the memory 240 may include: the first entity identification module 2431 is configured to extract a plurality of role candidate words from a text, and obtain at least one matching parameter corresponding to each role candidate word; the first entity identification module 2431 is further configured to select at least one role candidate word from the plurality of role candidate words as a first candidate role entity according to the at least one matching parameter corresponding to each role candidate word; the second entity identification module 2432 is configured to perform fusion processing on the question and the text corresponding to the text to obtain a fused text; the second entity identification module 2432 is further configured to perform entity identification processing on the fused text to obtain at least one second candidate role entity; the classification module 2433 is configured to perform role classification processing based on at least one first candidate role entity and at least one second candidate role entity to obtain a role in the text.

In the scheme, the types of the matching parameters comprise word frequency, solidifying degree and freedom degree; the first entity identifying module 2431 is further configured to perform the following processing for each character candidate word: determining the word frequency of the role candidate words in the text; dividing the role candidate words into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the role candidate words in the text, wherein the types of the morphemes comprise characters and words; determining the degree of consolidation of the role candidate words according to the occurrence probability of each morpheme in the text and the occurrence probability of the role candidate words in the text; and determining the left information entropy and the right information entropy of the role candidate words, and determining the degree of freedom of the role candidate words according to the left information entropy and the right information entropy of the role candidate words.

In the above scheme, the first entity identifying module 2431 is further configured to determine a plurality of left adjacent words and a plurality of right adjacent words of the character candidate word in the text; determining the sub information entropy corresponding to each left adjacent word, and determining the sub information entropy corresponding to each right adjacent word; determining the opposite number of the result of the summation of the sub-information entropies corresponding to each left adjacent word as a left information entropy, and determining the opposite number of the result of the summation of the sub-information entropies corresponding to each right adjacent word as a right information entropy; and when the left information entropy is larger than the right information entropy, determining the right information entropy as the degree of freedom, and when the left information entropy is not larger than the right information entropy, determining the left information entropy as the degree of freedom.

In the above scheme, the first entity identifying module 2431 is further configured to perform the following processing for each left adjacent word: determining the ratio of the occurrence frequency of the left adjacent character in the text to the occurrence frequency of all adjacent characters of the character candidate words in the text as a first ratio; carrying out logarithmic operation processing on the first ratio, and determining the product of the logarithmic operation result and the first ratio as the sub information entropy corresponding to the left adjacent word;

in the above scheme, the first entity identifying module 2431 is further configured to perform the following processing for each right adjacent word: determining the ratio of the occurrence frequency of the right adjacent character in the text to the occurrence frequency of all adjacent characters of the character candidate words in the text as a second ratio; and carrying out logarithm operation processing on the second ratio, and determining the product of the logarithm operation result and the second ratio as the sub information entropy corresponding to the right adjacent word.

In the foregoing solution, the first entity identifying module 2431 is further configured to perform the following processing for each morpheme: determining the ratio of the occurrence frequency of the morphemes in the text to the occurrence frequency of all the role candidate words in the text as the occurrence probability of the morphemes in the text; and determining the ratio of the occurrence times of the role candidate words in the text to the occurrence times of all the role candidate words in the text as the occurrence probability of the role candidate words in the text.

In the above scheme, the first entity identifying module 2431 is further configured to multiply the occurrence probability of each morpheme in the text to obtain a multiplication result; determining the ratio of the occurrence probability of the role candidate words in the text to the product result as a third ratio; and carrying out logarithmic operation processing on the third ratio, and determining the logarithmic operation result as the degree of solidification of the role candidate words.

In the foregoing solution, the first entity identifying module 2431 is further configured to select, as the first candidate role entity, a role candidate word that satisfies at least one of the following conditions from among the plurality of role candidate words: the word frequency of the character candidate words in the text exceeds a word frequency threshold, the degree of solidification of the character candidate words exceeds a degree of solidification threshold, and the degree of freedom of the character candidate words exceeds a degree of freedom threshold.

In the above scheme, the second entity identification module 2432 is further configured to perform feature extraction processing on the fused text to obtain a feature sequence; mapping the characteristic sequence to obtain at least one position set; the following processing is performed for each location set: combining the characters corresponding to the starting position in the position set, the characters between the starting position and the ending position in the position set and the characters corresponding to the ending position in the position set, and determining a combination result as a second candidate role entity.

In the above scheme, the second entity identifying module 2432 is further configured to divide the feature sequence into a plurality of sub-features, where the plurality of sub-features correspond to a plurality of words in the text one to one; mapping each sub-feature to a start probability belonging to a start position and an end probability belonging to an end position; selecting at least one sub-feature with the starting probability larger than the starting probability threshold as a starting sub-feature, and selecting at least one sub-feature with the ending probability larger than the ending probability threshold as an ending sub-feature; constructing at least one candidate start-stop feature set based on the selected at least one start sub-feature and at least one end sub-feature, wherein the candidate start-stop feature set comprises a start sub-feature and an end sub-feature; determining a target start-stop feature set in at least one candidate start-stop feature set; determining characters corresponding to the starting sub-features in the target starting and stopping feature set as starting positions in the position set, and determining characters corresponding to the ending sub-features in the target starting and stopping feature set as ending positions in the position set.

In the foregoing solution, the second entity identifying module 2432 is further configured to perform the following processing for each candidate start-stop feature set: performing fusion processing on the initial sub-feature and the ending sub-feature in the candidate start-stop feature set to obtain a first fusion feature, and mapping the first fusion feature to the probability of belonging to the same entity; and in at least one candidate start-stop feature set, determining the candidate start-stop feature set with the probability of belonging to the same entity larger than the entity probability threshold as a target start-stop feature set.

In the foregoing solution, the second entity identifying module 2432 is further configured to perform the following processing for each second candidate character entity: dividing the second candidate role entity into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text, wherein the types of the morphemes comprise characters and words; determining the degree of solidification of the second candidate role entity according to the probability of occurrence of each morpheme in the text and the probability of occurrence of the second candidate role entity in the text; determining the left information entropy and the right information entropy of the second candidate role entity, and determining the degree of freedom of the second candidate role entity according to the left information entropy and the right information entropy of the second candidate role entity; filtering, among the plurality of second candidate role entities, second candidate role entities that satisfy at least one of the following conditions: the word frequency of the second candidate role entity in the text does not exceed the word frequency threshold, the degree of solidification of the second candidate role entity does not exceed the degree of solidification threshold, and the degree of freedom of the second candidate role entity does not exceed the degree of freedom threshold.

In the foregoing solution, the classifying module 2433 is further configured to filter out duplicate candidate role entities from at least one first candidate role entity and at least one second candidate role entity; and executing the following processing aiming at each candidate role entity obtained after filtering: determining the sentence where the candidate role entity is located, and combining the candidate role entity and the sentence to obtain an entity sentence pair; extracting entity features and text features from the entity sentence pair, and fusing the entity features and the text features to obtain second fusion features; mapping the second fusion feature to a probability of belonging to a role entity; and when the probability of the character entity is greater than the character probability threshold value and the word frequency of the candidate character entity in the text is greater than the character word frequency threshold value, determining that the candidate character entity is the character in the text.

In the above scheme, the classification module 2433 is further configured to extract a plurality of word vectors from the entity sentence pair, and determine a mean value between the word vectors as an entity feature; coding the entity statement according to the direction from the starting position to the ending position to obtain a forward coding vector; coding the entity statement according to the direction from the end position to the start position to obtain a backward coding vector; and carrying out fusion processing on the forward coding vector and the backward coding vector to obtain text characteristics.

In the above scheme, the first entity identifying module 2431 is further configured to obtain a text, and perform the following preprocessing on the text: dividing the text into a plurality of sentences according to the symbol list, and filtering out symbols in each sentence; a plurality of character candidate words are extracted from each sentence of the preprocessed text.

In some embodiments, the logic of the role identification method in the text provided by the embodiments of the present application may be implemented in an intelligent contract, where different nodes determine the roles in the text by invoking the intelligent contracts of the respective nodes, and determine the final role by taking an intersection. According to the method and the device, the accuracy of recognizing the role from the text can be further improved through the cooperative processing among the nodes.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the character recognition method in the above text of the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, cause the processor to perform a role recognition method in a text provided by embodiments of the present application, for example, the role recognition method in the text shown in fig. 3, 4 and 5, where the computer includes various computing devices including an intelligent terminal and a server.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the computer-executable instructions may be in the form of programs, software modules, scripts or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and they may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, computer-executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, computer-executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the candidate role entities are extracted from the text in two different ways, and the roles in the text are determined according to the extracted candidate role entities, so that the diversity and comprehensiveness of the candidate role entities identified from the text can be ensured, and the efficiency and accuracy of identifying the roles from the text are improved; and the candidate role entities in the text are respectively determined based on the matching parameters or the fusion text, so that the integrity of the identified candidate role entities can be ensured. The efficiency and accuracy of recognizing the characters from the text can be further improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for recognizing characters in texts, the method comprising:

2. The method of claim 1,

the types of the matching parameters comprise word frequency, solidifying degree and freedom degree;

the obtaining of the at least one matching parameter corresponding to each role candidate word includes:

performing the following processing for each of the character candidate words:

determining the word frequency of the character candidate words in the text;

dividing the role candidate words into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the role candidate words in the text, wherein the types of the morphemes comprise characters and words;

determining the degree of solidification of the role candidate words according to the occurrence probability of each morpheme in the text and the occurrence probability of the role candidate words in the text;

and determining the left information entropy and the right information entropy of the role candidate words, and determining the degree of freedom of the role candidate words according to the left information entropy and the right information entropy of the role candidate words.

3. The method of claim 2, wherein the determining the left information entropy and the right information entropy of the character candidate word and the determining the degree of freedom of the character candidate word according to the left information entropy and the right information entropy of the character candidate word comprise:

determining a plurality of left adjacent characters and a plurality of right adjacent characters of the character candidate words in the text;

determining sub information entropy corresponding to each left adjacent word, and determining sub information entropy corresponding to each right adjacent word;

determining the opposite number of the result of the summation of the sub-information entropies corresponding to each left adjacent word as the left information entropy, and determining the opposite number of the result of the summation of the sub-information entropies corresponding to each right adjacent word as the right information entropy;

and when the left information entropy is larger than the right information entropy, determining the right information entropy as the degree of freedom, and when the left information entropy is not larger than the right information entropy, determining the left information entropy as the degree of freedom.

4. The method of claim 3,

the determining the sub information entropy corresponding to each left adjacent word comprises:

performing the following for each of the left adjacent words:

determining the ratio of the occurrence frequency of the left adjacent character in the text to the occurrence frequency of all adjacent characters of the character candidate words in the text as a first ratio;

carrying out logarithmic operation processing on the first ratio, and determining the product of a logarithmic operation result and the first ratio as a sub information entropy corresponding to the left adjacent word;

the determining the sub information entropy corresponding to each right adjacent word comprises:

performing the following for each of the right adjacent words:

determining the ratio of the occurrence times of the right adjacent characters in the text to the occurrence times of all adjacent characters of the character candidate words in the text as a second ratio;

and carrying out logarithm operation processing on the second ratio, and determining the product of a logarithm operation result and the second ratio as the sub information entropy corresponding to the right adjacent word.

5. The method of claim 2, wherein determining the probability of occurrence of each morpheme in the text and the probability of occurrence of the character candidate word in the text comprises:

the following processing is performed for each morpheme: determining the ratio of the occurrence frequency of the morphemes in the text to the occurrence frequency of all the role candidate words in the text as the occurrence probability of the morphemes in the text;

and determining the ratio of the occurrence times of the role candidate words in the text to the occurrence times of all the role candidate words in the text as the occurrence probability of the role candidate words in the text.

6. The method according to claim 2, wherein the determining the degree of solidity of the character candidate word according to the occurrence probability of each morpheme in the text and the occurrence probability of the character candidate word in the text comprises:

performing multiplication processing on the occurrence probability of each morpheme in the text to obtain a multiplication result;

determining the ratio of the occurrence probability of the role candidate words in the text to the product result as a third ratio;

and carrying out logarithmic operation processing on the third ratio, and determining a logarithmic operation result as the degree of solidification of the role candidate words.

7. The method of claim 2, wherein the selecting at least one role candidate word from the plurality of role candidate words as a first candidate role entity according to at least one matching parameter corresponding to each role candidate word comprises:

selecting a role candidate word meeting at least one of the following conditions from the plurality of role candidate words as the first candidate role entity:

the word frequency of the role candidate words in the text exceeds a word frequency threshold, the degree of solidification of the role candidate words exceeds a degree of solidification threshold, and the degree of freedom of the role candidate words exceeds a degree of freedom threshold.

8. The method according to claim 1, wherein the performing entity recognition processing on the fused text to obtain at least one second candidate character entity comprises:

carrying out feature extraction processing on the fused text to obtain a feature sequence;

mapping the characteristic sequence to obtain at least one position set;

performing the following for each of the location sets:

combining the characters corresponding to the starting position in the position set, the characters between the starting position and the ending position in the position set, and the characters corresponding to the ending position in the position set, and determining a combination result as the second candidate role entity.

9. The method according to claim 8, wherein the mapping the feature sequence to obtain at least one position set comprises:

dividing the characteristic sequence into a plurality of sub-characteristics, wherein the sub-characteristics correspond to a plurality of characters in the text one by one;

mapping each sub-feature to a start probability belonging to a start position and an end probability belonging to an end position;

selecting at least one sub-feature with the starting probability larger than a starting probability threshold as a starting sub-feature, and selecting at least one sub-feature with the ending probability larger than an ending probability threshold as an ending sub-feature;

constructing at least one candidate start-stop feature set based on the selected at least one start sub-feature and at least one end sub-feature, wherein the candidate start-stop feature set comprises a start sub-feature and an end sub-feature;

determining a target start-stop feature set in the at least one candidate start-stop feature set;

determining characters corresponding to the starting sub-features in the target starting and stopping feature set as starting positions in the position set, and determining characters corresponding to the ending sub-features in the target starting and stopping feature set as ending positions in the position set.

10. The method of claim 9, wherein determining a target start-stop feature set among the at least one candidate start-stop feature set comprises:

performing the following for each of the candidate start-stop feature sets: performing fusion processing on the starting sub-feature and the ending sub-feature in the candidate starting-ending feature set to obtain a first fusion feature, and mapping the first fusion feature to be the probability of belonging to the same entity;

and in the at least one candidate start-stop feature set, determining the candidate start-stop feature set with the probability of belonging to the same entity larger than an entity probability threshold as the target start-stop feature set.

11. The method according to claim 1, wherein after the entity recognition processing is performed on the fused text to obtain at least one second candidate character entity, the method further comprises:

performing the following for each of the second candidate role entities:

dividing the second candidate role entity into a plurality of morphemes, and determining the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text, wherein the types of the morphemes comprise words and expressions;

determining the degree of solidification of the second candidate role entity according to the occurrence probability of each morpheme in the text and the occurrence probability of the second candidate role entity in the text;

determining the left information entropy and the right information entropy of the second candidate role entity, and determining the degree of freedom of the second candidate role entity according to the left information entropy and the right information entropy of the second candidate role entity;

filtering, among the plurality of second candidate role entities, second candidate role entities that satisfy at least one of the following conditions:

the word frequency of the second candidate role entity in the text does not exceed a word frequency threshold, the degree of solidification of the second candidate role entity does not exceed a degree of solidification threshold, and the degree of freedom of the second candidate role entity does not exceed a degree of freedom threshold.

12. The method of claim 1, wherein the performing a role classification process based on at least one first candidate role entity and at least one second candidate role entity to obtain the role in the text comprises:

filtering out duplicate candidate role entities among the at least one first candidate role entity and the at least one second candidate role entity;

and executing the following processing aiming at each candidate role entity obtained after filtering:

determining the sentence where the candidate role entity is located, and combining the candidate role entity and the sentence to obtain an entity sentence pair;

extracting entity features and text features from the entity sentence pair, and fusing the entity features and the text features to obtain second fusion features;

mapping the second fusion feature to a probability of belonging to a role entity;

and when the probability of the role entity is greater than a role probability threshold value and the word frequency of the candidate role entity in the text is greater than a role word frequency threshold value, determining that the candidate role entity is the role in the text.

13. An apparatus for recognizing a character in a text, the apparatus comprising:

14. An electronic device, comprising:

a memory for storing computer executable instructions;

a processor for implementing the method of character recognition in the text of any one of claims 1 to 12 when executing computer executable instructions stored in the memory.

15. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of character recognition in text according to any one of claims 1 to 12 when executed.