CN114491104A - Method and device for identifying keywords - Google Patents

Method and device for identifying keywords Download PDF

Info

Publication number
CN114491104A
CN114491104A CN202011267216.6A CN202011267216A CN114491104A CN 114491104 A CN114491104 A CN 114491104A CN 202011267216 A CN202011267216 A CN 202011267216A CN 114491104 A CN114491104 A CN 114491104A
Authority
CN
China
Prior art keywords
visual
feature
features
semantic
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011267216.6A
Other languages
Chinese (zh)
Inventor
金志威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202011267216.6A priority Critical patent/CN114491104A/en
Publication of CN114491104A publication Critical patent/CN114491104A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application relates to the technical field of text mining, and discloses a method and a device for identifying keywords, which are used for solving the problem of low accuracy rate of extracting the keywords. The method comprises the following steps: extracting semantic features from the text sequence and extracting visual features from the visual sequence; determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; and finally, generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords. And the sequence information among the words is considered during coding, and the visual characteristics are used for assisting the model identification, so that the identification accuracy of the model is improved.

Description

Method and device for identifying keywords
Technical Field
The invention relates to the technical field of text mining, in particular to a method and a device for identifying keywords.
Background
The network live broadcast is an audio-visual entertainment mode which is started along with the development of mobile terminals in recent years, high interactivity and universality are presented, more and more users select to share information to the outside in a live broadcast mode, and a live broadcast platform gradually grows into a production channel of multimedia information. Therefore, the method can quickly and effectively identify the keywords in the massive multimedia live broadcast data, and is the key for realizing the semantic intercommunication of the multimedia data and the quick retrieval of the multimedia information.
At present, the identification of keywords mainly takes a text as input, and the keywords are identified by using word statistical information in the text or by using a graph model. However, the above introduced recognition method mainly relies on static word frequency statistical information, and understanding of context semantics is poor, so that recognition accuracy is low.
In view of the above, a new method and apparatus for recognizing keywords are needed to overcome the above-mentioned drawbacks.
Disclosure of Invention
The embodiment of the application provides a method and a device for identifying keywords, which are used for solving the problem of low identification accuracy rate when the keywords are identified.
The embodiment of the application provides the following specific technical scheme:
in a first aspect, an embodiment of the present application provides a method for identifying a keyword, including:
acquiring a text sequence and a visual sequence from live broadcast data, extracting semantic features from the text sequence, and extracting visual features from the visual sequence;
determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;
and generating the confidence coefficient of each semantic feature by using the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as key words.
Optionally, extracting visual features from the visual sequence includes:
and extracting the visual features on each image by using a trained visual embedding module, and carrying out position coding on each visual feature on each image, wherein the position coding information comprises coordinate information of a detection frame of each visual feature and space position information of the visual feature.
Optionally, determining the first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, including:
obtaining a first attention score for each of the semantic features based on the query vector for the semantic features and the key vectors in the visual feature context vector;
obtaining a second attention score of each semantic feature based on the first attention score set and a value vector in the visual feature context vector;
obtaining a third attention score of each semantic feature based on the second attention score set and a preset first attention coefficient;
determining a first degree of contextual relevance of each of the semantic features according to the third attention score.
Optionally, determining the second context association degree of each visual feature by using the query vector of the visual feature and the context vector of the semantic feature, including:
obtaining a fourth attention score for each of the visual features based on the query vector for the visual feature and the key vector in the context vector for the semantic feature;
obtaining a fifth attention score of each visual feature based on a fourth attention score set and a value vector in the context vector of the semantic features;
obtaining a sixth attention score of each visual feature based on the fifth attention score set and a preset second attention coefficient;
determining a second degree of contextual relevance for each of the visual features based on the sixth attention score.
In a second aspect, an embodiment of the present application further provides an apparatus for recognizing a keyword, including:
the first feature extraction unit is used for acquiring a text sequence and a visual sequence from live broadcast data, extracting semantic features from the text sequence and extracting visual features from the visual sequence;
the second feature extraction unit is used for determining a first context association degree of each semantic feature by utilizing the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by utilizing the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;
and the identification unit is used for generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.
Optionally, the first feature extraction unit is configured to:
and extracting the visual features on each image by using a trained visual embedding module, and carrying out position coding on each visual feature on each image, wherein the position coding information comprises coordinate information of a detection frame of each visual feature and space position information of the visual feature.
Optionally, the second feature extraction unit is configured to:
obtaining a first attention score for each of the semantic features based on the query vector for the semantic feature and the key vector in the visual feature context vector;
obtaining a second attention score of each semantic feature based on the first attention score set and a value vector in the visual feature context vector;
obtaining a third attention score of each semantic feature based on the second attention score set and a preset first attention coefficient;
determining a first degree of contextual relevance of each of the semantic features according to the third attention score.
Optionally, the second feature extraction unit is configured to:
obtaining a fourth attention score for each of the visual features based on the query vector for the visual feature and the key vector in the context vector for the semantic feature;
obtaining a fifth attention score of each visual feature based on a fourth attention score set and a value vector in the context vector of the semantic features;
obtaining a sixth attention score of each visual feature based on the fifth attention score set and a preset second attention coefficient;
determining a second degree of contextual relevance for each of the visual features based on the sixth attention score.
In a third aspect, an embodiment of the present application further provides a computing device, including:
a memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing any method for identifying the keywords according to the obtained program.
In a fourth aspect, an embodiment of the present application further provides a storage medium, which includes computer readable instructions, and when the computer readable instructions are read and executed by a computer, the computer is caused to execute any one of the above methods for identifying a keyword.
The beneficial effect of this application is as follows:
in the embodiment of the application, a text sequence and an image sequence are obtained from live broadcast data, semantic features are extracted aiming at the text sequence, and visual features are extracted aiming at a visual sequence; determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; and finally, generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords. During coding, attention of semantic features to the visual features and attention of the visual features to the semantic features are considered, sequence information among words is considered, the visual feature auxiliary model is used for predicting and identifying keywords in a text sequence, and the model identification accuracy is improved.
Drawings
Fig. 1 is a schematic structural diagram of a keyword recognition model according to an embodiment of the present application;
fig. 2 is a schematic flowchart of identifying a keyword according to an embodiment of the present application;
FIG. 3a is a schematic structural diagram of a common attention module according to an embodiment of the present disclosure;
FIG. 3b is a schematic diagram illustrating a structure of an attention unit of any one of the text information streams according to an embodiment of the present application;
FIG. 3c is a schematic diagram of a structure of an attention unit of any one of the visual information streams provided by the embodiment of the present application;
FIG. 4 is a schematic structural diagram of an apparatus for recognizing keywords according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
In order to solve the problem that the recognition accuracy is low when the keywords are recognized, a new technical scheme is provided in the embodiment of the application. The scheme comprises the following steps: acquiring a text sequence and an image sequence from live broadcast data, extracting semantic features aiming at the text sequence, and extracting visual features aiming at a visual sequence; determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; and finally, generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.
Preferred embodiments of the present application will be described in detail below with reference to the accompanying drawings.
In the embodiment of the application, a text sequence and an image sequence are obtained from live broadcast data, and then keywords are identified from the text sequence by comprehensively considering sequence information of semantic features and auxiliary information of visual features by using a trained keyword identification model. For the sake of understanding, referring to the schematic architecture diagram of the keyword recognition model shown in fig. 1, the overall structure of the model is described first.
The keyword recognition model is divided into three parts, namely a text information flow, a visual information flow and a classification module, wherein the text information flow comprises a language Embedding (Linear Embedding) module for converting a text sequence into word vectors, a self-attention neural network transducer module for extracting the characteristics of the word vectors, and a common attention module for fusing the visual characteristics and the semantic characteristics, and in order to further improve the accuracy, the transducer module for extracting the characteristics can be additionally arranged;
the Visual information stream comprises a Visual Embedding (Visual Embedding) module for extracting the characteristics of the Visual sequence and a common attention module for fusing the Visual characteristics and the semantic characteristics, and a Transformer module for extracting the characteristics can be added for further improving the identification accuracy;
and the classification module determines the confidence coefficient of each semantic feature according to the sequence information of the semantic features and the auxiliary information of the visual features, and outputs the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as key words.
When a keyword Recognition model is trained, the acquired massive live broadcast sample data is extracted and processed in modes of Automatic Speech Recognition (ASR), Optical Character Recognition (OCR) and the like, and massive text sequences and visual sequences are generated. The method comprises the steps of converting multimedia content generated by live broadcast sample data X into an ASR text by adopting an ASR mode aiming at any live broadcast sample data X, acquiring text information such as a live broadcast title and a live broadcast label from the live broadcast sample data X by adopting OCR and the like, and splicing the text information and the ASR text into a text sequence; and according to the set interval frame number, extracting images from the live broadcast content to generate a visual sequence. And each time the keyword recognition model reads a visual sequence and a corresponding text sequence, outputting a predicted keyword after operation, adjusting parameters in the model according to the difference value of the predicted keyword and an actual keyword, and judging that the model training is finished when a set iteration number is reached or the difference value does not exceed a set threshold value.
Referring next to fig. 2, a trained keyword recognition model is used to recognize keywords from a text sequence.
S201: and acquiring a text sequence and a visual sequence from the live data.
Interference elements such as invalid recognition and background noise may exist in the text sequence, so that the text sequence needs to be preprocessed before being input into the model to eliminate noise in the text sequence in order to facilitate subsequent model recognition work; and then, converting multimedia content generated by the live broadcast data into an ASR text by adopting an ASR mode, acquiring text information such as a live broadcast title and a live broadcast label from the live broadcast data by adopting OCR and other modes, and splicing the text information and the ASR text into a text sequence.
And extracting images from the live broadcast content according to the set interval frame number to generate a visual sequence. For example, live content is a video of 100 frames, one image is extracted every 10 frames, and the extracted 10 images are used to determine a visual sequence.
S202: semantic features are extracted from the text sequence and visual features are extracted from the visual sequence.
Referring to fig. 1, a text sequence is input into a language embedding module, so that each word in the text sequence is converted into a corresponding word vector, in this embodiment, the word vector is a row-by-row n-column vector (referred to as an n-dimensional vector for short); and inputting the word vector matrix into a Transformer module to determine the semantic features of the text sequence. And inputting the visual sequence into a trained visual embedding module, wherein the module marks target objects contained in each image in a detection frame mode, and the marked target objects are the visual features in the embodiment of the application.
When a plurality of target objects are recognized on the same image, the positional relationship between the target objects also affects the meaning expressed by the image, for example, three target objects of a person, a guitar and a chair are recognized on the image 1, wherein the person sits on the chair and holds the guitar, and the meaning expressed by the image 1 is that the person sits on the chair and talks about the guitar; likewise, three target objects of a person, a guitar and a chair are also identified on the image 2, wherein the person is sitting on the chair and the guitar is resting on the ground against the chair, and the meaning expressed by the image 2 is that the person is sitting on the chair and the chair is resting on the ground. Therefore, in order to better mark out the visual features having a greater influence on determining the keywords and to distinguish the visual features identified in the same image, it is necessary to provide the model with the sequence information of each visual feature, that is, the module performs position coding on each visual feature, where one position coding includes coordinate information and spatial position information of one visual feature.
The specific operation process is as follows: establishing a plane rectangular coordinate system on the image by taking the upper left corner of the image as an origin, the horizontal right side as the positive direction of an x axis and the vertical downward side as the positive direction of a y axis, and further determining coordinate information of each detection frame on the image, for example, acquiring coordinate information of the top left corner vertex and the bottom right corner vertex of the detection frame; and determining the spatial position information of the visual features according to the preset coding priority. For example, three target objects, i.e., a person, a guitar, and a table, are determined on one image, and the person has the highest priority and spatial position information thereof is set to 1, as seen from the predetermined encoding priority; the spatial position information of the guitar is set to 0.8 in the priority level; the stationary object of the table class has the lowest priority, and the spatial position information thereof is set to 0.5.
S203: determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; the query vector of the semantic features and the context vector of the semantic features are generated based on the semantic feature set, and the query vector of the visual features and the context vector of the visual features are generated based on the visual feature set.
Inputting the extracted semantic feature set into a common attention module of a text information stream, as shown in fig. 3a, the common attention module of the text information stream includes a multi-head attention unit and a feed-forward neural network, the multi-head attention unit is configured to label an attention degree of each semantic feature to the input visual feature (i.e., a first context association degree of each semantic feature), wherein each head attention unit reads a query vector of the semantic feature and a context vector of the visual feature, and outputs a plurality of feature maps, and a line on each feature map represents the first context association degree of one semantic feature; in order to fuse information labeled by a multi-head attention unit, splicing a plurality of feature maps into a set, and adding and normalizing the set and an initially input semantic feature set to obtain a context feature map of a text sequence; and the feedforward neural network is used for extracting the characteristics of the input context characteristic diagram, and adding and normalizing the extracted characteristic diagram and the initially input context characteristic diagram to obtain a target characteristic diagram of the text sequence. In order to improve the identification accuracy, a Transformer module can be added to perform one or more times of feature extraction operation on the feature map.
Similarly, referring to fig. 3a, a multi-head attention unit and a feed-forward neural network are also included in the common attention module of the visual information stream, the multi-head attention unit is used to label the attention degree of each visual feature to the input semantic feature (i.e. the second context association degree of each visual feature), wherein each head attention unit reads the query vector of the visual feature and the context vector of the semantic feature respectively, outputs a plurality of feature maps, and a row on each feature map represents the second context association degree of one visual feature; in order to fuse information labeled by a multi-head attention unit, splicing a plurality of feature maps into a set, and adding and normalizing the set and an initially input visual feature set to obtain a context feature map of a visual sequence; the feedforward neural network is used for extracting the characteristics of the input context characteristic diagram, and adding and normalizing the extracted characteristic diagram and the initially input context characteristic diagram to obtain a target characteristic diagram of the visual sequence. In order to improve the identification accuracy, a Transformer module can be added to perform one or more times of feature extraction operation on the feature map.
First, a process of generating a query vector and a context vector will be described.
Three matrixes with the same size are arranged in a common attention module of a text information stream, and after an input semantic feature set is multiplied by the three matrixes respectively, a query vector Qw, a key vector Kw and a value vector Vw of semantic features are obtained, wherein Kw and Vw are uniformly called context vectors of the semantic features in the embodiment of the application; similarly, three matrices with the same size are also set in the common attention module of the visual information stream, and after the input visual feature set is multiplied by the three matrices respectively, a query vector Qv, a key vector Kv and a value vector Vv of the visual features are obtained, and Kv and Vv are collectively referred to as context vectors of the visual features in the embodiment of the present application.
Next, the internal structure and the use of the attention unit of any one head will be described.
Referring to fig. 3b, the attention unit of the text information stream is divided into three layers, in the first layer, a query vector Qw of semantic features, a key vector Kv of visual features, and a value vector Vv are input; setting a Softmax function at the second layer for mapping the first attention score of each semantic feature to the interval of (0, 1), wherein the first attention score of each semantic feature is obtained based on Qw and Kv; in the third layer, a point multiplication operation is performed on the first attention score and Vv of each semantic feature to generate a second attention score of each semantic feature, and the second attention score of each semantic feature is used as an input of the fourth layer and is multiplied by an attention coefficient w to obtain a third attention score of each semantic feature, namely, the first context association degree of each semantic feature is determined. And scoring the input semantic features in an attention unit of the text information stream by using a multi-head attention mechanism, and giving higher weight to important words in the text sequence, so that the feedforward neural network can learn hidden layer features in the text sequence better, and the identification accuracy of the model is improved.
Referring to fig. 3c, the attention unit of the visual information stream is divided into three layers, and a query vector Qv of the visual feature, a key vector Kw of the semantic feature, and a value vector Vw are inputted in the first layer; setting a Softmax function at the second layer for mapping the fourth attention score of each visual feature to the interval of (0, 1), wherein the fourth attention score of each visual feature is obtained based on Qv and Kw; and in the third layer, performing a dot multiplication operation on the fourth attention score of each visual feature and Vw to generate a fifth attention score of each visual feature, taking the fifth attention score of each visual feature as an input of the fourth layer, and multiplying the fifth attention score by an attention coefficient w' to obtain a sixth attention score of each visual feature, namely determining the second context association degree of each visual feature. A multi-head attention mechanism is used for scoring the input visual features in an attention unit of the visual information flow, and important target objects in a visual sequence are endowed with higher weights, so that the feedforward neural network can learn hidden layer features in the visual sequence better, a model is assisted to predict and identify key words in a text sequence, and the identification accuracy of the model is improved.
S204: and generating the confidence coefficient of each semantic feature by using the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.
Splicing the target characteristic diagram of the text sequence output by one side of the text information flow with the target characteristic diagram of the visual sequence output by one side of the visual information flow to obtain a new characteristic diagram; inputting the new feature map into a classification module such as cross entropy Softmax and Conditional Random Fields (CRF) to determine the confidence coefficient of each semantic feature, and outputting words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as keywords.
Based on the same inventive concept, in the embodiment of the present invention, an apparatus for recognizing a keyword is provided, as shown in fig. 4, and includes at least a first feature extraction unit 401, a second feature extraction unit 402, and a recognition unit 403, wherein,
a first feature extraction unit 401, configured to obtain a text sequence and a visual sequence from live broadcast data, extract semantic features from the text sequence, and extract visual features from the visual sequence;
a second feature extraction unit 402, configured to determine a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determine a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;
the identifying unit 403 is configured to generate a confidence level of each semantic feature by using the first contextual relevance degree and the second contextual relevance degree, and output a word corresponding to the semantic feature whose confidence level is higher than a set threshold as a keyword.
Optionally, the first feature extraction unit 401 is configured to:
and extracting the visual features on each image by using a trained visual embedding module, and carrying out position coding on each visual feature on each image, wherein the position coding information comprises coordinate information of a detection frame of each visual feature and space position information of the visual feature.
Optionally, the second feature extraction unit 402 is configured to:
obtaining a first attention score for each of the semantic features based on the query vector for the semantic feature and the key vector in the visual feature context vector;
obtaining a second attention score of each semantic feature based on the first attention score set and a value vector in the visual feature context vector;
obtaining a third attention score of each semantic feature based on the second attention score set and a preset first attention coefficient;
determining a first degree of contextual relevance of each of the semantic features according to the third attention score.
Optionally, the second feature extraction unit 402 is configured to:
obtaining a fourth attention score for each of the visual features based on the query vector for the visual feature and the key vector in the context vector for the semantic feature;
obtaining a fifth attention score of each visual feature based on a fourth attention score set and a value vector in the context vector of the semantic features;
obtaining a sixth attention score of each visual feature based on the fifth attention score set and a preset second attention coefficient;
determining a second degree of contextual relevance for each of the visual features based on the sixth attention score.
Based on the same inventive concept, in the embodiment of the present invention, a computing device is provided, as shown in fig. 5, which at least includes a memory 501 and at least one processor 502, where the memory 501 and the processor 502 complete communication with each other through a communication bus;
the memory 501 is used for storing program instructions;
the processor 502 is configured to call the program instructions stored in the memory 501, and execute the aforementioned method for identifying keywords according to the obtained program.
Based on the same inventive concept, in the embodiments of the present invention, a storage medium at least includes computer readable instructions, which when read and executed by a computer, cause the computer to execute the aforementioned method for identifying a keyword.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A method for identifying keywords, comprising:
acquiring a text sequence and a visual sequence from live broadcast data, extracting semantic features from the text sequence, and extracting visual features from the visual sequence;
determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;
and generating the confidence coefficient of each semantic feature by using the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as key words.
2. The method of claim 1, wherein extracting visual features from the visual sequence comprises:
and extracting the visual features on each image by using a trained visual embedding module, and carrying out position coding on each visual feature on each image, wherein the position coding information comprises coordinate information of a detection frame of each visual feature and space position information of the visual feature.
3. The method of claim 1, wherein determining a first degree of contextual relevance for each semantic feature using a query vector of semantic features and a context vector of visual features comprises:
obtaining a first attention score for each of the semantic features based on the query vector for the semantic feature and the key vector in the visual feature context vector;
obtaining a second attention score of each semantic feature based on the first attention score set and a value vector in the visual feature context vector;
obtaining a third attention score of each semantic feature based on the second attention score set and a preset first attention coefficient;
determining a first degree of contextual relevance of each of the semantic features according to the third attention score.
4. The method of claim 1, wherein determining a second degree of contextual relevance for each visual feature using the query vector of visual features and the context vector of semantic features comprises:
obtaining a fourth attention score for each of the visual features based on the query vector for the visual feature and the key vector in the context vector for the semantic feature;
obtaining a fifth attention score of each visual feature based on a fourth attention score set and a value vector in the context vector of the semantic features;
obtaining a sixth attention score of each visual feature based on the fifth attention score set and a preset second attention coefficient;
determining a second degree of contextual relevance for each of the visual features based on the sixth attention score.
5. An apparatus for recognizing a keyword, comprising:
the first feature extraction unit is used for acquiring a text sequence and a visual sequence from live broadcast data, extracting semantic features from the text sequence and extracting visual features from the visual sequence;
the second feature extraction unit is used for determining a first context association degree of each semantic feature by using the query vector of the semantic features and the context vector of the visual features, and determining a second context association degree of each visual feature by using the query vector of the visual features and the context vector of the semantic features; wherein the query vector of semantic features and the context vector of semantic features are generated based on a semantic feature set, and the query vector of visual features and the context vector of visual features are generated based on a visual feature set;
and the identification unit is used for generating the confidence coefficient of each semantic feature by utilizing the first context association degree and the second context association degree, and outputting the words corresponding to the semantic features with the confidence coefficient higher than a set threshold value as the keywords.
6. The apparatus of claim 5, wherein the first feature extraction unit is to:
and extracting the visual features on each image by using a trained visual embedding module, and carrying out position coding on each visual feature on each image, wherein the position coding information comprises coordinate information of a detection frame of each visual feature and space position information of the visual feature.
7. The apparatus of claim 5, wherein the second feature extraction unit is to:
obtaining a first attention score for each of the semantic features based on the query vector for the semantic features and the key vectors in the visual feature context vector;
obtaining a second attention score of each semantic feature based on the first attention score set and a value vector in the visual feature context vector;
obtaining a third attention score of each semantic feature based on the second attention score set and a preset first attention coefficient;
determining a first degree of contextual relevance of each of the semantic features according to the third attention score.
8. The apparatus of claim 5, wherein the second feature extraction unit is to:
obtaining a fourth attention score for each of the visual features based on the query vector for the visual feature and the key vector in the context vector for the semantic feature;
obtaining a fifth attention score of each visual feature based on a fourth attention score set and a value vector in the context vector of the semantic features;
obtaining a sixth attention score of each visual feature based on the fifth attention score set and a preset second attention coefficient;
determining a second degree of contextual relevance for each of the visual features based on the sixth attention score.
9. A computing device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 4 in accordance with the obtained program.
10. A storage medium comprising computer readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 4.
CN202011267216.6A 2020-11-13 2020-11-13 Method and device for identifying keywords Pending CN114491104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011267216.6A CN114491104A (en) 2020-11-13 2020-11-13 Method and device for identifying keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011267216.6A CN114491104A (en) 2020-11-13 2020-11-13 Method and device for identifying keywords

Publications (1)

Publication Number Publication Date
CN114491104A true CN114491104A (en) 2022-05-13

Family

ID=81490202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011267216.6A Pending CN114491104A (en) 2020-11-13 2020-11-13 Method and device for identifying keywords

Country Status (1)

Country Link
CN (1) CN114491104A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638222A (en) * 2022-05-17 2022-06-17 天津卓朗科技发展有限公司 Natural disaster data classification method and model training method and device thereof
CN115809665A (en) * 2022-12-13 2023-03-17 杭州电子科技大学 Unsupervised keyword extraction method based on bidirectional multi-granularity attention mechanism

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638222A (en) * 2022-05-17 2022-06-17 天津卓朗科技发展有限公司 Natural disaster data classification method and model training method and device thereof
CN115809665A (en) * 2022-12-13 2023-03-17 杭州电子科技大学 Unsupervised keyword extraction method based on bidirectional multi-granularity attention mechanism
CN115809665B (en) * 2022-12-13 2023-07-11 杭州电子科技大学 Unsupervised keyword extraction method based on bidirectional multi-granularity attention mechanism

Similar Documents

Publication Publication Date Title
CN111046133B (en) Question and answer method, equipment, storage medium and device based on mapping knowledge base
CN110148400B (en) Pronunciation type recognition method, model training method, device and equipment
CN106649694A (en) Method and device for identifying user's intention in voice interaction
CN110858269B (en) Fact description text prediction method and device
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN110956037B (en) Multimedia content repeated judgment method and device
CN114491104A (en) Method and device for identifying keywords
CN110347866B (en) Information processing method, information processing device, storage medium and electronic equipment
CN110956038B (en) Method and device for repeatedly judging image-text content
CN112527972A (en) Intelligent customer service chat robot implementation method and system based on deep learning
CN111161314A (en) Target object position area determining method and device, electronic equipment and storage medium
CN110717027B (en) Multi-round intelligent question-answering method, system, controller and medium
CN113468323A (en) Dispute focus category and similarity judgment method, dispute focus category and similarity judgment system, dispute focus category and similarity judgment device and dispute focus category and similarity judgment recommendation method
CN110717068B (en) Video retrieval method based on deep learning
CN113792166B (en) Information acquisition method and device, electronic equipment and storage medium
CN115019319A (en) Structured picture content identification method based on dynamic feature extraction
CN113486160B (en) Dialogue method and system based on cross-language knowledge
CN114970467A (en) Composition initial draft generation method, device, equipment and medium based on artificial intelligence
CN112992148A (en) Method and device for recognizing voice in video
CN111681676A (en) Method, system and device for identifying and constructing audio frequency by video object and readable storage medium
CN111708862A (en) Text matching method and device and electronic equipment
CN113076956B (en) Image description generation method, system, medium and electronic device
CN116341555B (en) Named entity recognition method and system
CN117573839B (en) Document retrieval method, man-machine interaction method, electronic device and storage medium
CN114462391B (en) Nested entity identification method and system based on contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination