CN114491104A

CN114491104A - Method and device for identifying keywords

Info

Publication number: CN114491104A
Application number: CN202011267216.6A
Authority: CN
Inventors: 金志威
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2022-05-13

Abstract

The application relates to the technical field of text mining, and discloses a method and a device for identifying keywords, which are used for solving the problem of low accuracy rate of extracting the keywords. The method comprises the following steps: extracting semantic features from the text sequence and extracting visual features from the visual sequence; determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; and finally, generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords. And the sequence information among the words is considered during coding, and the visual characteristics are used for assisting the model identification, so that the identification accuracy of the model is improved.

Description

Method and device for identifying keywords

Technical Field

The invention relates to the technical field of text mining, in particular to a method and a device for identifying keywords.

Background

The network live broadcast is an audio-visual entertainment mode which is started along with the development of mobile terminals in recent years, high interactivity and universality are presented, more and more users select to share information to the outside in a live broadcast mode, and a live broadcast platform gradually grows into a production channel of multimedia information. Therefore, the method can quickly and effectively identify the keywords in the massive multimedia live broadcast data, and is the key for realizing the semantic intercommunication of the multimedia data and the quick retrieval of the multimedia information.

At present, the identification of keywords mainly takes a text as input, and the keywords are identified by using word statistical information in the text or by using a graph model. However, the above introduced recognition method mainly relies on static word frequency statistical information, and understanding of context semantics is poor, so that recognition accuracy is low.

In view of the above, a new method and apparatus for recognizing keywords are needed to overcome the above-mentioned drawbacks.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying keywords, which are used for solving the problem of low identification accuracy rate when the keywords are identified.

The embodiment of the application provides the following specific technical scheme:

in a first aspect, an embodiment of the present application provides a method for identifying a keyword, including:

acquiring a text sequence and a visual sequence from live broadcast data, extracting semantic features from the text sequence, and extracting visual features from the visual sequence;

determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;

and generating the confidence coefficient of each semantic feature by using the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as key words.

Optionally, extracting visual features from the visual sequence includes:

and extracting the visual features on each image by using a trained visual embedding module, and carrying out position coding on each visual feature on each image, wherein the position coding information comprises coordinate information of a detection frame of each visual feature and space position information of the visual feature.

Optionally, determining the first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, including:

obtaining a first attention score for each of the semantic features based on the query vector for the semantic features and the key vectors in the visual feature context vector;

obtaining a second attention score of each semantic feature based on the first attention score set and a value vector in the visual feature context vector;

obtaining a third attention score of each semantic feature based on the second attention score set and a preset first attention coefficient;

determining a first degree of contextual relevance of each of the semantic features according to the third attention score.

Optionally, determining the second context association degree of each visual feature by using the query vector of the visual feature and the context vector of the semantic feature, including:

obtaining a fourth attention score for each of the visual features based on the query vector for the visual feature and the key vector in the context vector for the semantic feature;

obtaining a fifth attention score of each visual feature based on a fourth attention score set and a value vector in the context vector of the semantic features;

obtaining a sixth attention score of each visual feature based on the fifth attention score set and a preset second attention coefficient;

determining a second degree of contextual relevance for each of the visual features based on the sixth attention score.

In a second aspect, an embodiment of the present application further provides an apparatus for recognizing a keyword, including:

the first feature extraction unit is used for acquiring a text sequence and a visual sequence from live broadcast data, extracting semantic features from the text sequence and extracting visual features from the visual sequence;

the second feature extraction unit is used for determining a first context association degree of each semantic feature by utilizing the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by utilizing the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;

and the identification unit is used for generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.

Optionally, the first feature extraction unit is configured to:

Optionally, the second feature extraction unit is configured to:

obtaining a first attention score for each of the semantic features based on the query vector for the semantic feature and the key vector in the visual feature context vector;

Optionally, the second feature extraction unit is configured to:

In a third aspect, an embodiment of the present application further provides a computing device, including:

a memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing any method for identifying the keywords according to the obtained program.

In a fourth aspect, an embodiment of the present application further provides a storage medium, which includes computer readable instructions, and when the computer readable instructions are read and executed by a computer, the computer is caused to execute any one of the above methods for identifying a keyword.

The beneficial effect of this application is as follows:

in the embodiment of the application, a text sequence and an image sequence are obtained from live broadcast data, semantic features are extracted aiming at the text sequence, and visual features are extracted aiming at a visual sequence; determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; and finally, generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords. During coding, attention of semantic features to the visual features and attention of the visual features to the semantic features are considered, sequence information among words is considered, the visual feature auxiliary model is used for predicting and identifying keywords in a text sequence, and the model identification accuracy is improved.

Drawings

Fig. 1 is a schematic structural diagram of a keyword recognition model according to an embodiment of the present application;

fig. 2 is a schematic flowchart of identifying a keyword according to an embodiment of the present application;

FIG. 3a is a schematic structural diagram of a common attention module according to an embodiment of the present disclosure;

FIG. 3b is a schematic diagram illustrating a structure of an attention unit of any one of the text information streams according to an embodiment of the present application;

FIG. 3c is a schematic diagram of a structure of an attention unit of any one of the visual information streams provided by the embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus for recognizing keywords according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

In order to solve the problem that the recognition accuracy is low when the keywords are recognized, a new technical scheme is provided in the embodiment of the application. The scheme comprises the following steps: acquiring a text sequence and an image sequence from live broadcast data, extracting semantic features aiming at the text sequence, and extracting visual features aiming at a visual sequence; determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; and finally, generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.

Preferred embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In the embodiment of the application, a text sequence and an image sequence are obtained from live broadcast data, and then keywords are identified from the text sequence by comprehensively considering sequence information of semantic features and auxiliary information of visual features by using a trained keyword identification model. For the sake of understanding, referring to the schematic architecture diagram of the keyword recognition model shown in fig. 1, the overall structure of the model is described first.

The keyword recognition model is divided into three parts, namely a text information flow, a visual information flow and a classification module, wherein the text information flow comprises a language Embedding (Linear Embedding) module for converting a text sequence into word vectors, a self-attention neural network transducer module for extracting the characteristics of the word vectors, and a common attention module for fusing the visual characteristics and the semantic characteristics, and in order to further improve the accuracy, the transducer module for extracting the characteristics can be additionally arranged;

the Visual information stream comprises a Visual Embedding (Visual Embedding) module for extracting the characteristics of the Visual sequence and a common attention module for fusing the Visual characteristics and the semantic characteristics, and a Transformer module for extracting the characteristics can be added for further improving the identification accuracy;

and the classification module determines the confidence coefficient of each semantic feature according to the sequence information of the semantic features and the auxiliary information of the visual features, and outputs the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as key words.

When a keyword Recognition model is trained, the acquired massive live broadcast sample data is extracted and processed in modes of Automatic Speech Recognition (ASR), Optical Character Recognition (OCR) and the like, and massive text sequences and visual sequences are generated. The method comprises the steps of converting multimedia content generated by live broadcast sample data X into an ASR text by adopting an ASR mode aiming at any live broadcast sample data X, acquiring text information such as a live broadcast title and a live broadcast label from the live broadcast sample data X by adopting OCR and the like, and splicing the text information and the ASR text into a text sequence; and according to the set interval frame number, extracting images from the live broadcast content to generate a visual sequence. And each time the keyword recognition model reads a visual sequence and a corresponding text sequence, outputting a predicted keyword after operation, adjusting parameters in the model according to the difference value of the predicted keyword and an actual keyword, and judging that the model training is finished when a set iteration number is reached or the difference value does not exceed a set threshold value.

Referring next to fig. 2, a trained keyword recognition model is used to recognize keywords from a text sequence.

S201: and acquiring a text sequence and a visual sequence from the live data.

Interference elements such as invalid recognition and background noise may exist in the text sequence, so that the text sequence needs to be preprocessed before being input into the model to eliminate noise in the text sequence in order to facilitate subsequent model recognition work; and then, converting multimedia content generated by the live broadcast data into an ASR text by adopting an ASR mode, acquiring text information such as a live broadcast title and a live broadcast label from the live broadcast data by adopting OCR and other modes, and splicing the text information and the ASR text into a text sequence.

And extracting images from the live broadcast content according to the set interval frame number to generate a visual sequence. For example, live content is a video of 100 frames, one image is extracted every 10 frames, and the extracted 10 images are used to determine a visual sequence.

S202: semantic features are extracted from the text sequence and visual features are extracted from the visual sequence.

Referring to fig. 1, a text sequence is input into a language embedding module, so that each word in the text sequence is converted into a corresponding word vector, in this embodiment, the word vector is a row-by-row n-column vector (referred to as an n-dimensional vector for short); and inputting the word vector matrix into a Transformer module to determine the semantic features of the text sequence. And inputting the visual sequence into a trained visual embedding module, wherein the module marks target objects contained in each image in a detection frame mode, and the marked target objects are the visual features in the embodiment of the application.

When a plurality of target objects are recognized on the same image, the positional relationship between the target objects also affects the meaning expressed by the image, for example, three target objects of a person, a guitar and a chair are recognized on the image 1, wherein the person sits on the chair and holds the guitar, and the meaning expressed by the image 1 is that the person sits on the chair and talks about the guitar; likewise, three target objects of a person, a guitar and a chair are also identified on the image 2, wherein the person is sitting on the chair and the guitar is resting on the ground against the chair, and the meaning expressed by the image 2 is that the person is sitting on the chair and the chair is resting on the ground. Therefore, in order to better mark out the visual features having a greater influence on determining the keywords and to distinguish the visual features identified in the same image, it is necessary to provide the model with the sequence information of each visual feature, that is, the module performs position coding on each visual feature, where one position coding includes coordinate information and spatial position information of one visual feature.

The specific operation process is as follows: establishing a plane rectangular coordinate system on the image by taking the upper left corner of the image as an origin, the horizontal right side as the positive direction of an x axis and the vertical downward side as the positive direction of a y axis, and further determining coordinate information of each detection frame on the image, for example, acquiring coordinate information of the top left corner vertex and the bottom right corner vertex of the detection frame; and determining the spatial position information of the visual features according to the preset coding priority. For example, three target objects, i.e., a person, a guitar, and a table, are determined on one image, and the person has the highest priority and spatial position information thereof is set to 1, as seen from the predetermined encoding priority; the spatial position information of the guitar is set to 0.8 in the priority level; the stationary object of the table class has the lowest priority, and the spatial position information thereof is set to 0.5.

S203: determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; the query vector of the semantic features and the context vector of the semantic features are generated based on the semantic feature set, and the query vector of the visual features and the context vector of the visual features are generated based on the visual feature set.

Inputting the extracted semantic feature set into a common attention module of a text information stream, as shown in fig. 3a, the common attention module of the text information stream includes a multi-head attention unit and a feed-forward neural network, the multi-head attention unit is configured to label an attention degree of each semantic feature to the input visual feature (i.e., a first context association degree of each semantic feature), wherein each head attention unit reads a query vector of the semantic feature and a context vector of the visual feature, and outputs a plurality of feature maps, and a line on each feature map represents the first context association degree of one semantic feature; in order to fuse information labeled by a multi-head attention unit, splicing a plurality of feature maps into a set, and adding and normalizing the set and an initially input semantic feature set to obtain a context feature map of a text sequence; and the feedforward neural network is used for extracting the characteristics of the input context characteristic diagram, and adding and normalizing the extracted characteristic diagram and the initially input context characteristic diagram to obtain a target characteristic diagram of the text sequence. In order to improve the identification accuracy, a Transformer module can be added to perform one or more times of feature extraction operation on the feature map.

Similarly, referring to fig. 3a, a multi-head attention unit and a feed-forward neural network are also included in the common attention module of the visual information stream, the multi-head attention unit is used to label the attention degree of each visual feature to the input semantic feature (i.e. the second context association degree of each visual feature), wherein each head attention unit reads the query vector of the visual feature and the context vector of the semantic feature respectively, outputs a plurality of feature maps, and a row on each feature map represents the second context association degree of one visual feature; in order to fuse information labeled by a multi-head attention unit, splicing a plurality of feature maps into a set, and adding and normalizing the set and an initially input visual feature set to obtain a context feature map of a visual sequence; the feedforward neural network is used for extracting the characteristics of the input context characteristic diagram, and adding and normalizing the extracted characteristic diagram and the initially input context characteristic diagram to obtain a target characteristic diagram of the visual sequence. In order to improve the identification accuracy, a Transformer module can be added to perform one or more times of feature extraction operation on the feature map.

First, a process of generating a query vector and a context vector will be described.

Three matrixes with the same size are arranged in a common attention module of a text information stream, and after an input semantic feature set is multiplied by the three matrixes respectively, a query vector Qw, a key vector Kw and a value vector Vw of semantic features are obtained, wherein Kw and Vw are uniformly called context vectors of the semantic features in the embodiment of the application; similarly, three matrices with the same size are also set in the common attention module of the visual information stream, and after the input visual feature set is multiplied by the three matrices respectively, a query vector Qv, a key vector Kv and a value vector Vv of the visual features are obtained, and Kv and Vv are collectively referred to as context vectors of the visual features in the embodiment of the present application.

Next, the internal structure and the use of the attention unit of any one head will be described.

Referring to fig. 3b, the attention unit of the text information stream is divided into three layers, in the first layer, a query vector Qw of semantic features, a key vector Kv of visual features, and a value vector Vv are input; setting a Softmax function at the second layer for mapping the first attention score of each semantic feature to the interval of (0, 1), wherein the first attention score of each semantic feature is obtained based on Qw and Kv; in the third layer, a point multiplication operation is performed on the first attention score and Vv of each semantic feature to generate a second attention score of each semantic feature, and the second attention score of each semantic feature is used as an input of the fourth layer and is multiplied by an attention coefficient w to obtain a third attention score of each semantic feature, namely, the first context association degree of each semantic feature is determined. And scoring the input semantic features in an attention unit of the text information stream by using a multi-head attention mechanism, and giving higher weight to important words in the text sequence, so that the feedforward neural network can learn hidden layer features in the text sequence better, and the identification accuracy of the model is improved.

Referring to fig. 3c, the attention unit of the visual information stream is divided into three layers, and a query vector Qv of the visual feature, a key vector Kw of the semantic feature, and a value vector Vw are inputted in the first layer; setting a Softmax function at the second layer for mapping the fourth attention score of each visual feature to the interval of (0, 1), wherein the fourth attention score of each visual feature is obtained based on Qv and Kw; and in the third layer, performing a dot multiplication operation on the fourth attention score of each visual feature and Vw to generate a fifth attention score of each visual feature, taking the fifth attention score of each visual feature as an input of the fourth layer, and multiplying the fifth attention score by an attention coefficient w' to obtain a sixth attention score of each visual feature, namely determining the second context association degree of each visual feature. A multi-head attention mechanism is used for scoring the input visual features in an attention unit of the visual information flow, and important target objects in a visual sequence are endowed with higher weights, so that the feedforward neural network can learn hidden layer features in the visual sequence better, a model is assisted to predict and identify key words in a text sequence, and the identification accuracy of the model is improved.

S204: and generating the confidence coefficient of each semantic feature by using the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.

Splicing the target characteristic diagram of the text sequence output by one side of the text information flow with the target characteristic diagram of the visual sequence output by one side of the visual information flow to obtain a new characteristic diagram; inputting the new feature map into a classification module such as cross entropy Softmax and Conditional Random Fields (CRF) to determine the confidence coefficient of each semantic feature, and outputting words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as keywords.

Based on the same inventive concept, in the embodiment of the present invention, an apparatus for recognizing a keyword is provided, as shown in fig. 4, and includes at least a first feature extraction unit 401, a second feature extraction unit 402, and a recognition unit 403, wherein,

a first feature extraction unit 401, configured to obtain a text sequence and a visual sequence from live broadcast data, extract semantic features from the text sequence, and extract visual features from the visual sequence;

a second feature extraction unit 402, configured to determine a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determine a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;

the identifying unit 403 is configured to generate a confidence level of each semantic feature by using the first contextual relevance degree and the second contextual relevance degree, and output a word corresponding to the semantic feature whose confidence level is higher than a set threshold as a keyword.

Optionally, the first feature extraction unit 401 is configured to:

Optionally, the second feature extraction unit 402 is configured to:

Based on the same inventive concept, in the embodiment of the present invention, a computing device is provided, as shown in fig. 5, which at least includes a memory 501 and at least one processor 502, where the memory 501 and the processor 502 complete communication with each other through a communication bus;

the memory 501 is used for storing program instructions;

the processor 502 is configured to call the program instructions stored in the memory 501, and execute the aforementioned method for identifying keywords according to the obtained program.

Based on the same inventive concept, in the embodiments of the present invention, a storage medium at least includes computer readable instructions, which when read and executed by a computer, cause the computer to execute the aforementioned method for identifying a keyword.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A method for identifying keywords, comprising:

2. The method of claim 1, wherein extracting visual features from the visual sequence comprises:

3. The method of claim 1, wherein determining a first degree of contextual relevance for each semantic feature using a query vector of semantic features and a context vector of visual features comprises:

4. The method of claim 1, wherein determining a second degree of contextual relevance for each visual feature using the query vector of visual features and the context vector of semantic features comprises:

5. An apparatus for recognizing a keyword, comprising:

the second feature extraction unit is used for determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;

6. The apparatus of claim 5, wherein the first feature extraction unit is to:

7. The apparatus of claim 5, wherein the second feature extraction unit is to:

8. The apparatus of claim 5, wherein the second feature extraction unit is to:

9. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 4 in accordance with the obtained program.

10. A storage medium comprising computer readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 4.