CN111382340A

CN111382340A - Information identification method, information identification device and electronic equipment

Info

Publication number: CN111382340A
Application number: CN202010200833.8A
Authority: CN
Inventors: 潘禄; 陈玉光; 李法远; 韩翠云; 刘远圳; 黄佳艳
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-07-07

Abstract

The application discloses an information identification method, an information identification device and electronic equipment, and relates to the field of knowledge maps. The specific implementation scheme is as follows: acquiring information to be identified; performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

Description

Information identification method, information identification device and electronic equipment

Technical Field

The present application relates to data processing technologies in the field of computer technologies, and in particular, to an information identification method, an information identification apparatus, and an electronic device.

Background

With the rapid popularization of the internet, network information is explosively increased, and it takes a long time to screen out interesting information from a large amount of information.

In the related art, when a user queries information through a web browser, query information is input into the browser, a search engine of the browser searches the information according to the query information, and a search result is returned to the browser for display.

In a search engine, there is a type of information that matches the user's query information, but the information includes core content that is not highly correlated with the query information, which is not the information that the user needs. That is, in the prior art, the accuracy of the search result is low due to the low accuracy of the information identification.

Disclosure of Invention

The embodiment of the application provides an information identification method, an information identification device and electronic equipment, and aims to solve the problem that in the prior art, the accuracy of a search result is low due to the low accuracy of information identification.

In order to solve the above technical problem, the present application is implemented as follows:

a first aspect of the present application provides an information identification method, including:

acquiring information to be identified;

performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized;

and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified.

Further, the performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized includes:

performing word segmentation processing on the information to be recognized to obtain at least one target word;

obtaining vector information of each target word in the at least one target word;

and determining the vector information of the information to be identified according to the vector information of each target word.

Further, the vector information comprises a location vector;

the obtaining vector information of each target word in the at least one target word includes:

if the number of entities included in the information to be identified is greater than or equal to M and the number of verbs included in the information to be identified is greater than or equal to N, acquiring M entities and N verbs in the information to be identified, wherein M and N are positive integers;

for each target word in the at least one target word, respectively calculating M first relative positions of the target word to the M entities and N second relative positions of the target word to the N verbs;

respectively mapping the M first relative positions and the N second relative positions to a normal distribution vector with a preset dimensionality to obtain M first position vectors and N second position vectors;

splicing the M first position vectors according to the sequence of the M entities in the information to be identified to obtain a first spliced vector;

splicing the N second position vectors according to the sequence of the N verbs in the information to be identified to obtain second spliced vectors;

and splicing the first splicing vector and the second splicing vector, and taking a splicing result as a position vector of the target word.

Further, if the number of entities included in the information to be identified is greater than or equal to M and the number of verbs included in the information to be identified is greater than or equal to N, acquiring the M entities and the N verbs in the information to be identified, including:

if the number of entities included in the information to be recognized is greater than M and the number of verbs included in the information to be recognized is greater than or equal to N, or if the number of verbs included in the information to be recognized is greater than N and the number of entities included in the information to be recognized is greater than or equal to M, performing syntactic dependency analysis on the information to be recognized to obtain a plurality of dependency pairs;

selecting entities and verbs in the multiple dependency pairs, wherein the entities and the verbs are included in the same dependency pair, and m entities and n verbs are obtained, and m and n are positive integers;

if M is smaller than M, selecting i entities from entities except the M entities of the information to be identified to obtain i entities, wherein i is the difference value between M and M;

if N is smaller than N, j verbs are selected from the verbs except the N verbs of the information to be recognized to obtain j verbs, wherein j is the difference value of N and N.

Further, the vector information comprises a location vector;

for each target word in the at least one target word, if the number U of entities included in the information to be identified is less than M, obtaining U first relative positions from the target word to the U entities, wherein the U and the M are positive integers;

initializing the U first relative positions by adopting a 0 vector to obtain M first relative positions;

if the number V of verbs included in the information to be identified is smaller than N, acquiring V second relative positions from the target word to the V verbs, wherein V and N are positive integers;

initializing the V second relative positions by adopting a 0 vector to obtain N second relative positions;

mapping the M first relative positions and the N second relative positions to the normal distribution vector respectively to obtain M first position vectors and N second position vectors;

Further, the identification result includes: a twitch hot spot type or a non-twitch hot spot type.

A second aspect of the present application provides an information identifying apparatus, comprising:

the first acquisition module is used for acquiring information to be identified;

the second acquisition module is used for performing word segmentation processing on the information to be identified so as to acquire vector information of the information to be identified;

and the third acquisition module is used for inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and acquire the identification result of the information to be identified.

Further, the second obtaining module includes:

the first obtaining sub-module is used for carrying out word segmentation processing on the information to be identified to obtain at least one target word;

the second obtaining submodule is used for obtaining the vector information of each target word in the at least one target word;

and the determining submodule is used for determining the vector information of the information to be identified according to the vector information of each target word.

Further, the vector information comprises a location vector;

the second obtaining sub-module includes:

a first obtaining unit, configured to obtain M entities and N verbs in the information to be identified if the number of entities included in the information to be identified is greater than or equal to M and the number of verbs included in the information to be identified is greater than or equal to N, where M and N are positive integers;

a first calculating unit, configured to calculate, for each of the at least one target word, M first relative positions of the target word to the M entities and N second relative positions of the target word to the N verbs, respectively;

a second obtaining unit, configured to map the M first relative positions and the N second relative positions to a normal distribution vector with a preset dimension, respectively, to obtain M first position vectors and N second position vectors;

a third obtaining unit, configured to splice the M first position vectors according to a sequence of the M entities in the information to be identified, so as to obtain a first spliced vector;

a fourth obtaining unit, configured to splice the N second position vectors according to a sequence of the N verbs in the information to be identified, so as to obtain a second spliced vector;

and the first splicing unit is used for splicing the first splicing vector and the second splicing vector and taking a splicing result as the position vector of the target word.

Further, the first obtaining unit is configured to:

Further, the vector information comprises a location vector;

the second obtaining sub-module includes:

a fifth obtaining unit, configured to, for each target word in the at least one target word, obtain U first relative positions from the target word to U entities if the number U of entities included in the information to be identified is less than M, where U and M are positive integers;

a sixth obtaining unit, configured to initialize the U first relative positions with a 0 vector to obtain M first relative positions;

a seventh obtaining unit, configured to obtain, if a number V of verbs included in the information to be recognized is less than N, V second relative positions from the target word to the V verbs, where V and N are positive integers;

an eighth obtaining unit, configured to initialize the V second relative positions with a 0 vector to obtain N second relative positions;

a ninth obtaining unit, configured to map the M first relative positions and the N second relative positions onto the normal distribution vector, respectively, to obtain M first position vectors and N second position vectors;

a tenth obtaining unit, configured to splice the M first position vectors according to a sequence of the M entities in the information to be identified, so as to obtain a first spliced vector;

an eleventh obtaining unit, configured to splice the N second position vectors according to a sequence of the N verbs in the information to be identified, so as to obtain a second spliced vector;

and the second splicing unit is used for splicing the first splicing vector and the second splicing vector and taking a splicing result as the position vector of the target word.

A third aspect of the present application provides an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

A fourth aspect of the present application provides a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect.

One embodiment in the above application has the following advantages or benefits:

acquiring information to be identified; performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

Acquiring information to be identified; performing word segmentation processing on the information to be recognized to obtain at least one target word; obtaining vector information of each target word in the at least one target word; determining the vector information of the information to be identified according to the vector information of each target word; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

By obtaining the position vector of each target word in the at least one target word, the relative positions between the target word and the entity and verb of the information to be recognized can be represented by the position vector, the type of the information to be recognized can be recognized by utilizing the relationship between the target word and the entity and verb, and the recognition accuracy can be improved.

And if the number of the entities included in the information to be identified is greater than M and the number of the verbs included in the information to be identified is greater than or equal to N, or if the number of the verbs included in the information to be identified is greater than N and the number of the entities included in the information to be identified is greater than or equal to M, performing syntactic dependency analysis on the information to be identified to obtain a plurality of dependency pairs, and preferentially selecting the entities and the verbs included in the same dependency pair in the plurality of dependency pairs to improve the accuracy of identifying the information to be identified.

And when the number of entities or the number of verbs included in the information to be recognized is smaller than a preset value, initializing the U first relative positions or the V second relative positions by adopting a 0 vector so as to finally obtain the M first relative positions and the N second relative positions, so that the accuracy of the recognition model for type recognition of the information to be recognized is improved.

The recognition result comprises: a twitch hot spot type or a non-twitch hot spot type. Therefore, whether the information to be identified is the hot spot type or not can be known according to the identification model, the information judged as the hot spot type can be screened out in the field of information screening, and information interference is reduced. The information identification method can be applied to the field of search, the information in the search result is judged through the identification model, the information judged as the type of the hot spot is screened out, and the search accuracy is improved.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of an information identification method provided in an embodiment of the present application;

fig. 2 is a second flowchart of an information identification method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a recognition model provided in an embodiment of the present application;

fig. 4 is a block diagram of an information recognition apparatus provided in an embodiment of the present application;

fig. 5 is a block diagram of an electronic device for implementing the information identification method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, fig. 1 is a flowchart of an information identification method provided in an embodiment of the present application, and as shown in fig. 1, the embodiment provides an information identification method applied to an electronic device, including the following steps:

step 101, obtaining information to be identified.

The information to be identified can be information input by a user, and can also be search result information obtained after a search engine carries out query according to query information input by the user. The information to be recognized may be text information, for example, a piece of text.

And 102, performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized.

The information to be recognized is subjected to word segmentation processing, one or more words can be obtained, and each word obtained after the word segmentation processing can be a target word. For example, if the information to be recognized is "coming from twilight to man street", the word segmentation process obtains three words: the three words are three target words, namely, Xiaoming, arriving and Tang people street.

After word segmentation processing, at least one target word is obtained, then vector information of each target word is respectively obtained, and further vector information of information to be identified is obtained according to the vector information of each target word. And determining the position vector of the target word according to the target word and the entity and verb in the information to be recognized. Words representing persons, organizations, places, organizations, or the like in the information to be recognized may be regarded as entities.

The vector information may further include a word vector and a part-of-speech vector, and further, the vector information may further include a noun vector and a reference word vector.

The location vector is a vector representation of the relative locations of the current vocabulary (i.e., the target word) and the subject of the potential event (including the potential entity and the trigger of the potential event, i.e., the entity and verb in the information to be recognized).

Step 103, inputting the vector information of the information to be recognized into a recognition model, so as to recognize the type of the information to be recognized through the recognition model, and obtain a recognition result of the information to be recognized.

And inputting the vector information of the information to be identified into an identification model to obtain an identification result. The recognition result can be understood as a matching degree between the keyword in the information to be recognized and the core content of the information to be recognized. The core content of the information to be recognized can also be understood as the semantics of the information to be recognized.

The training sample of the recognition model can be vector information obtained according to the training corpus, and the neural network model is trained by utilizing the vector information obtained according to the training corpus to obtain the recognition model. The way of obtaining the vector information according to the training corpus is the same as the way of obtaining the vector information according to the information to be identified in the present application, and is not described herein again.

In the embodiment, information to be identified is acquired; performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

Referring to fig. 2, fig. 2 is a second flowchart of an information identification method provided in the present embodiment, and as shown in fig. 2, the present embodiment provides an information identification method applied to an electronic device, including the following steps:

step 201, obtaining information to be identified.

Step 202, performing word segmentation processing on the information to be recognized to obtain at least one target word.

The target word may be one or more. And performing word segmentation on the information to be recognized to obtain one or more words, wherein each word obtained after the word segmentation is a target word. For example, if the information to be recognized is "coming from twilight to man street", the word segmentation process obtains three words: the three words are three target words, namely, Xiaoming, arriving and Tang people street.

Step 203, obtaining vector information of each target word in the at least one target word, wherein the vector information includes a position vector.

Vector information is obtained for each target word, and one target word obtains one vector information. And determining the position vector of the target word according to the target word and the entity and verb in the information to be recognized. Words representing persons, organizations, places, organizations, or the like in the information to be recognized may be regarded as entities.

The vector information may also include word vectors and part-of-speech vectors. The acquisition process of the word vector comprises the following steps: and inputting the target word into an unsupervised model to obtain a word vector of the target word, wherein the training sample of the unsupervised model can comprise a news title and a text. Part-of-speech vector (POS Embedding) means that part-of-speech of a target word is mapped into a multi-dimensional vector, the same part-of-speech is initialized by using the same vector, and in recognition model training, the part-of-speech vector can be subjected to value optimization according to different training corpora and targets.

Further, the vector information may also include a noun vector and a reference word vector. Extracting nouns through a language tool, wherein the nouns can be entity nouns such as characters, mechanisms and places; the pronouns are obtained by rules, such as he, she, it, etc., and converted into vectors as part of the input features (i.e., vector information).

And 204, determining the vector information of the information to be identified according to the vector information of each target word.

And splicing the vector information of each target word to obtain the vector information of the information to be identified. For example, the information to be recognized includes 2 target words, vector information of the 2 target words is a and B, the vector information of the information to be recognized can be obtained by character splicing the a and B, the symbols a and B are used to represent two vectors, and the expression that a and B are vectors is not limited.

Step 205, inputting the vector information of the information to be recognized into a recognition model, so as to recognize the type of the information to be recognized through the recognition model, and obtain a recognition result of the information to be recognized.

The type of the twitching hot spot means that a keyword (query) which can be searched by a user exists in the information, but the keyword is not the core content spoken by the information, and the type of the information is the type of the twitching hot spot. For example, for the text "zhang san is a special alternative to famous shadow sai four, skin white is apparent, graduates are the top of the country at the university of film and television, and the experience of performance is rich", the core content described by the text is information about zhang san, and if the text is searched by searching using the keyword "lie four", the text belongs to the type of twitching hotspot because the core content is not the content about lie four.

For another example, for a text "zhang san is a famous movie star, skin white is beautiful, and the university of film and television that graduates to the top of the country has rich experience in performance", the core content described by the text is information about zhang san, and if the text is searched by using the keyword "zhang san", the text belongs to a non-twitching hotspot type because the core content is the content about zhang san.

The training sample of the recognition model can be vector information obtained according to the training corpus, and the neural network model is trained by utilizing the vector information obtained according to the training corpus to obtain the recognition model.

In the embodiment, information to be identified is acquired; performing word segmentation processing on the information to be recognized to obtain at least one target word; obtaining vector information of each target word in the at least one target word; determining the vector information of the information to be identified according to the vector information of each target word; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

In an embodiment of the present application, the step 203 of obtaining vector information of each target word in the at least one target word includes:

In this embodiment, M and N are preset values, which may be preset, for example, M is set to 2, and N is set to 1. Preferably, M is 3 and N is 2. If the number of entities included in the information to be recognized is greater than or equal to M and the number of verbs included in the information to be recognized is greater than or equal to N, that is, the number of entities included in the information to be recognized and the number of verbs are not less than respective preset values, then M entities and N verbs can be obtained from the information to be recognized.

For each target word of the at least one target word, respectively calculating M first relative positions of the target word to the M entities and N second relative positions of the target word to the N verbs. For example, if the at least one target word includes a first target word and a second target word, the entity includes a first entity and a second entity, and the verb includes a first verb, a first relative position between the first target word and the first entity and a second relative position between the first target word and the second entity are calculated, and 2 first relative positions are obtained; and calculating a second relative position between the first target word and the first verb to obtain 1 second relative position.

Similarly, for the second target word, calculating a first relative position between the second target word and the first entity and a second relative position between the second target word and the second entity to obtain 2 first relative positions; and calculating a second relative position between the second target word and the first verb to obtain 1 second relative position.

And then, for each target word, mapping the M first relative positions and the N second relative positions corresponding to the target word to a normal distribution vector with preset dimensionality respectively to obtain M first position vectors and N second position vectors. The preset dimension can be set according to actual conditions, and is not limited herein.

Further, the M first position vectors corresponding to the target word are spliced according to the sequence of the M entities in the information to be identified to obtain a first spliced vector; and splicing the N second position vectors corresponding to the target word according to the sequence of the N verbs in the information to be recognized to obtain a second spliced vector. Splicing is understood to be the splicing of strings, i.e. the M first position vectors are spliced end to end in the manner of strings.

And then, splicing the first splicing vector and the second splicing vector corresponding to the target word, and taking a splicing result as a position vector of the target word. In the present application, splicing is understood as splicing the first splicing vector and the second splicing vector end to end in the form of character strings.

In this embodiment, by obtaining the position vector of each target word in the at least one target word, the relative positions between the target word and the entity and the verb of the information to be recognized can be represented by the position vector, and the type of the information to be recognized is recognized by using the relationship between the target word and the entity and the verb, so that the recognition accuracy can be improved.

In an embodiment of the present application, if the number of entities included in the information to be identified is greater than or equal to M, and the number of verbs included in the information to be identified is greater than or equal to N, acquiring the M entities and the N verbs in the information to be identified includes:

selecting entities and verbs in the plurality of dependency pairs, wherein the entities and the verbs are included in the same dependency pair, and m entities and n verbs are obtained; m and n are positive integers;

In this embodiment, when the number of entities of the information to be recognized is greater than M and the number of verbs is not less than N, or when the number of verbs of the information to be recognized is greater than N and the number of entities is not less than M, M entities and N verbs need to be selected from the entities and the verbs of the information to be recognized.

During selection, the entities and verbs in the same dependency pair are preferentially selected, namely, the entities and the verbs directly have dependency relationship to form a dependency pair. For example, if you yell lie four, let lie four call king five, there is a direct relationship between "yelk" and "yell", and there is no direct relationship between "yelk" and "call" in the same dependency pair, then the entity "yelk" and the verb "yell" located in the same dependency pair are preferably selected.

After all entities and verbs located in the same dependency pair are selected, if the number of the entities is smaller than M, i entities are selected from the remaining entities of the information to be identified, so that the total number of the finally selected entities is M. When i entities are selected from the remaining entities of the information to be identified, the selection may be performed according to the importance of the remaining entities, or according to the sequence of the remaining entities in the information to be identified, which is not limited herein.

If the number of verbs is smaller than N, j verbs are selected from the remaining verbs of the information to be recognized, so that the number of the finally selected verbs is N. When j verbs are selected from the remaining verbs of the information to be recognized, the selection may be performed according to the importance scores of the remaining verbs, or according to the order of the remaining verbs in the information to be recognized, which is not limited herein.

In this embodiment, when the number of entities included in the information to be recognized is greater than M and the number of verbs included in the information to be recognized is greater than or equal to N, or if the number of verbs included in the information to be recognized is greater than N and the number of entities included in the information to be recognized is greater than or equal to M, performing syntactic dependency analysis on the information to be recognized to obtain a plurality of dependency pairs, and preferentially selecting the entities and the verbs included in the same dependency pair in the plurality of dependency pairs, so as to improve the accuracy of recognizing the information to be recognized.

mapping the M first relative positions and the N second relative positions to the normal distribution vector respectively to obtain M first position vectors and N second position vectors of the target word;

The embodiment is a case when the number of entities or verbs included in the information to be recognized is smaller than a preset value. M and N are preset values and can be preset, and preferably M is 3 and N is 2.

If the number U of the entities included in the information to be identified is less than M, obtaining U first relative positions of the target word to the U entities, then initializing the U first relative positions by adopting 0 vectors to obtain M first relative positions, and during initialization, filling the U first relative positions by adopting one or more 0 vectors to obtain M first relative positions. The length of a 0 vector is the same as the length of a position vector. If the number V of the entities included in the information to be identified is smaller than N, V second relative positions from the target word to the V entities are obtained, then the V second relative positions are initialized by adopting 0 vectors to obtain N second relative positions, and during initialization, the V second relative positions can be filled by adopting one or more 0 vectors to obtain N second relative positions. The length of a 0 vector is the same as the length of a position vector. Finally, mapping the M first relative positions and the N second relative positions to the normal distribution vector respectively to obtain M first position vectors and N second position vectors; splicing the M first position vectors according to the sequence of the M entities in the information to be identified to obtain a first spliced vector; splicing the N second position vectors according to the sequence of the N verbs in the information to be identified to obtain second spliced vectors; and splicing the first splicing vector and the second splicing vector, and taking a splicing result as a position vector of the target word. All the target words in the information to be identified can be processed in the above manner, and M first position vectors and N second position vectors corresponding to the target words are obtained.

In this embodiment, when the number of entities or the number of verbs included in the information to be recognized is smaller than a preset value, the U first relative positions or the V second relative positions are initialized by using the 0 vector, so as to finally obtain the M first relative positions and the N second relative positions, thereby improving the accuracy of the recognition model in performing type recognition on the information to be recognized.

In this embodiment, the processing mode of obtaining the word vector, the position vector, and the part-of-speech vector according to the information to be recognized may be applied to the recognition model in the training stage, and in the processing process of the training corpus, as shown in fig. 3, fig. 3 is a schematic diagram of each layer structure of the recognition model.

The recognition model is shown in FIG. 3 to include an input layer, a convolutional layer, a fully-connected layer, and an output layer. The input layer inputs word vectors, position vectors, part of speech vectors, nouns and word-designating vectors of the training corpus.

Vectorizing the input sentence: firstly, segmenting words of sentences, and segmenting each word (which can be understood as a word obtained after segmenting words of an input sentence) by three vectors:

word vectors (word entries): the word segmentation method is characterized by comprising the following steps of training a large amount of texts after word segmentation by using an unsupervised model, constructing a training corpus for training by using the existing open-source trained word vectors, and training the training corpus according to the corpus consisting of a large amount of news titles and texts;

position vector (Position Embedding): the position vector is a vector representation of the relative positions of the current vocabulary (which is the word obtained after the word segmentation process) and the potential event subject (which comprises the potential entity and the potential event trigger, namely the entity and the verb of the input sentence); firstly, using a Natural Language Processing (NLPC) service to identify entities and verbs in a sentence, then calculating the relative position between each word and the identified entities and verbs, for example, the current word is the 4 th word in the sentence, the position of an entity in the sentence is 7, then the position of the current word relative to the entity is-4, then-4 is mapped to a normal distribution vector with preset dimensions, different numbers are mapped to different vectors, and the vectors are optimized according to training corpus and target in the model training process.

Because there may be multiple entities and verbs in the sentence, in the present application, three entities are preferred, two verbs are preferred, and more than three entities need to be screened through dependency analysis, the rule of screening is to use dependency analysis to obtain the dependency relationship between each vocabulary, if there is a direct relationship between the entities and the verbs, the entities and the verbs are preferentially selected, otherwise, the selection is performed according to the order in which the entities and the verbs appear in the sentence. A specific vector initialization is not sufficient. Finally, the three entity relative position vectors and the verb relative position vector are spliced together to serve as the position vector of the vocabulary.

Part of speech vector (POS Embedding): the part of speech of a word is mapped into a multi-dimensional vector, the same part of speech is initialized by using the same vector, and in model training, the vector can be subjected to value optimization according to different training linguistic data and targets.

Nouns and refer to word vectors: extracting nouns through a language tool, wherein the nouns can be entity nouns such as characters, mechanisms and places; the pronouns are obtained through rules, such as he, she, and the like, and are converted into vectors as part of the input features.

And (3) rolling layers: the convolutional layer has the function of extracting local features through a plurality of convolution kernels (Feature maps) and simultaneously avoiding excessive parameters in a network, and the convolutional layer with a convolution window of 3 is used for extracting the features, wherein the number of the extracted features is a predefined parameter. Equal length convolution is used in the application, and the convolution result is consistent with the input width.

And then pooling the convolution features, wherein the purpose of pooling is to find out the most important feature information at the same position, the method uses the maximum pooling operation, the maximum value is taken by the same dimension, and the result after pooling is output.

And finally, performing nonlinear change on the pooled result to obtain a sentence context vector, wherein the context vector represents the context feature of the whole sentence, and tanh is used for performing nonlinear transformation in the application.

Full connection layer: the sentence context vector is output, the sentence context vector output by the convolutional layer.

An output layer: the output layer inputs are multidimensional vectors and the outputs are predefined categories. In the application, the context vector of the sentence and the level vector-language model of the sentence are spliced together to be used as input, and the probability of whether the sentence is a hot-rub or not is output.

Referring to fig. 4, fig. 4 is a structural diagram of an information recognition apparatus according to an embodiment of the present application, and as shown in fig. 4, the present embodiment provides an information recognition apparatus 400 including:

a first obtaining module 401, configured to obtain information to be identified;

a second obtaining module 402, configured to perform word segmentation processing on the information to be identified to obtain vector information of the information to be identified;

a third obtaining module 403, configured to input vector information of the information to be identified into an identification model, so as to identify the type of the information to be identified through the identification model, and obtain an identification result of the information to be identified.

In one embodiment of the present application, the recognition result includes: a twitch hot spot type or a non-twitch hot spot type.

In an embodiment of the present application, the second obtaining module 402 includes:

In an embodiment of the present application, the second obtaining sub-module includes:

In an embodiment of the present application, the first obtaining unit is configured to:

selecting entities and verbs in the plurality of dependency pairs, wherein the entities and the verbs are included in the same dependency pair, and m entities and n verbs are obtained;

In an embodiment of the present application, the obtaining the second obtaining sub-module includes:

The information identification apparatus 400 can implement the processes implemented by the electronic device in the method embodiments shown in fig. 1-2, and is not described herein again to avoid repetition.

The information identification device 400 of the embodiment of the application acquires information to be identified; performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 5, the electronic device according to the information identification method of the embodiment of the present application is a block diagram. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 5, the electronic apparatus includes: one or more processors 501, memory 502, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. Each component is interconnected using a different bus and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 501 is taken as an example.

Memory 502 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the information identification method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the information identification method provided by the present application.

The memory 502, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the information identification method in the embodiment of the present application (for example, the first obtaining module 401, the second obtaining module 402, and the third obtaining module 403 shown in fig. 4). The processor 501 executes various functional applications of the server and data processing, i.e., implements the information identification method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 502.

The memory 502 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device implementing the information recognition method, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 502 optionally includes memory located remotely from processor 501, which may be connected via a network to an electronic device implementing the information recognition method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device implementing the information recognition method may further include: an input device 503 and an output device 504. The processor 501, the memory 502, the input device 503 and the output device 504 may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus implementing the information recognition method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

According to the technical scheme of the embodiment of the application, information to be identified is obtained; performing word segmentation processing on the information to be recognized to obtain vector information of the information to be recognized; and inputting the vector information of the information to be identified into an identification model so as to identify the type of the information to be identified through the identification model and obtain an identification result of the information to be identified. The type of the information to be identified is identified through the identification model, so that the identification result of the information to be identified can be obtained, and the identification accuracy of the information to be identified is improved. In the search field, the information identification accuracy is improved, the interference of hot spot information to the search result can be reduced, and the search accuracy is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An information identification method, comprising:

acquiring information to be identified;

2. The information identification method according to claim 1, wherein the performing word segmentation processing on the information to be identified to obtain vector information of the information to be identified comprises:

3. The information identification method according to claim 2, wherein the vector information includes a position vector;

4. The information identification method according to claim 3, wherein if the number of entities included in the information to be identified is greater than or equal to M and the number of verbs included in the information to be identified is greater than or equal to N, acquiring the M entities and the N verbs in the information to be identified includes:

5. The information identification method according to claim 2, wherein the vector information includes a position vector;

6. The information identification method according to claim 1, wherein the identification result includes: a twitch hot spot type or a non-twitch hot spot type.

7. An information identifying apparatus, comprising:

8. The information identifying apparatus according to claim 7,

the second obtaining module includes:

9. The information identifying apparatus according to claim 8, wherein the vector information includes a position vector;

the second obtaining sub-module includes:

10. The information identification device according to claim 9, wherein the first acquisition unit is configured to:

11. The information identifying apparatus according to claim 8, wherein the vector information includes a position vector;

the second obtaining sub-module includes:

a ninth obtaining unit, configured to map the M first relative positions and the N second relative positions to a normal distribution vector of a preset dimension, respectively, to obtain M first position vectors and N second position vectors;

12. The information recognition apparatus according to claim 7, wherein the recognition result includes: a twitch hot spot type or a non-twitch hot spot type.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.