WO2023246264A1 - 基于注意力模块的信息识别方法和相关装置 - Google Patents

基于注意力模块的信息识别方法和相关装置 Download PDF

Info

Publication number
WO2023246264A1
WO2023246264A1 PCT/CN2023/089375 CN2023089375W WO2023246264A1 WO 2023246264 A1 WO2023246264 A1 WO 2023246264A1 CN 2023089375 W CN2023089375 W CN 2023089375W WO 2023246264 A1 WO2023246264 A1 WO 2023246264A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
attention
parameter
attention module
shared
Prior art date
Application number
PCT/CN2023/089375
Other languages
English (en)
French (fr)
Inventor
汤志远
黄申
商世东
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023246264A1 publication Critical patent/WO2023246264A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present application relates to the field of computers, and specifically to information recognition based on attention modules.
  • Embodiments of the present application provide an information recognition method and device, a storage medium and an electronic device based on an attention module, to at least solve the problem in related technologies that the attention recognition model accelerates the calculation process, resulting in a large performance loss of the recognition model. question.
  • an information identification method based on an attention module including: obtaining target media resource characteristics of the target media resource, and inputting the target media resource characteristics into the target information identification model,
  • the target information recognition model includes an N-layer attention module, N is a positive integer greater than or equal to 2; the target media resource characteristics are processed through the N-layer attention module to obtain a target representation vector, where, The i-th layer attention module in the N-layer attention module is used to determine the i-th layer attention weight parameters and the i-th layer input representation vector according to a set of shared parameters and the i-th group of non-shared parameters, and according to the The i-th layer attention weight parameter and the i-th layer input representation vector determine the i-th layer representation vector output by the i-th layer attention module, 1 ⁇ i ⁇ N, when i is less than N, the i-th layer representation vector The i-layer representation vector is used to determine the i+1th group of non-shared parameters
  • the i-th layer representation vector is used to determine the target representation vector, At least 2 layers of attention modules in the N-layer attention modules share the set of shared parameters, and the at least 2 layers of attention modules include the i-th layer attention module; according to the target representation vector, determine Target information identification result, wherein the target information identification result is used to represent the target information identified from the target media resource.
  • an information identification device based on an attention module including: an acquisition module, configured to acquire the target media resource characteristics of the target media resource and input the target media resource characteristics. into the target information recognition model, wherein the target information recognition model includes an N-layer attention module, where N is a positive integer greater than or equal to 2; a processing module for analyzing the target media through the N-layer attention module Resource characteristics are processed, Obtain the target representation vector, wherein the i-th layer attention module in the N-layer attention module is used to determine the i-th layer attention weight parameter and the i-th layer input based on a set of shared parameters and the i-th set of non-shared parameters.
  • the i-th layer representation vector determines the i-th layer representation vector output by the i-th layer attention module according to the i-th layer attention weight parameter and the i-th layer input representation vector, 1 ⁇ i ⁇ N, when i is less than N
  • the i-th layer representation vector is used to determine the i+1-th group of non-shared parameters used by the i+1-th layer attention module.
  • the i-th layer representation vector is used to determine Determine the target representation vector, and at least 2 layers of attention modules in the N-layer attention module share the set of shared parameters, and the at least 2 layers of attention modules include the i-th layer attention module; determine A module configured to determine a target information identification result according to the target representation vector, wherein the target information identification result is used to represent the target information identified from the target media resource.
  • a computer-readable storage medium stores a computer program, wherein the computer program is configured to execute the above attention-based Module information identification method.
  • a computer program product includes a computer program, and the computer program is stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the above information recognition method based on the attention module.
  • an electronic device including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to execute the above-mentioned attention-based module through the computer program. information identification method.
  • the target media resource characteristics of the target media resource are obtained, and the target media resource characteristics are input into the target information identification model, where the target information identification model includes an N-layer attention module, and N is greater than or equal to A positive integer of 2, the target media resource features are processed through the N-layer attention module to obtain the target representation vector.
  • the i-th layer attention module in the N-layer attention module is used to calculate the target media resource characteristics according to a set of shared parameters and the i-th group.
  • Non-shared parameters determine the i-th layer attention weight parameter and i-th layer input representation vector, and determine the i-th layer representation vector output by the i-th layer attention module based on the i-th layer attention weight parameter and i-th layer input representation vector , 1 ⁇ i ⁇ N, when i is less than N, the i-th layer representation vector is used to determine the i+1-th group of non-shared parameters used by the i+1-th layer attention module, when i is equal to N, The i-th layer representation vector is used to determine the target representation vector.
  • At least 2 layers of attention modules in the N-layer attention module share a set of shared parameters. At least 2 layers of attention modules include the i-th layer attention module.
  • the target representation vector determine the target information recognition result, where the target information recognition result is used to represent the target information recognized from the target media resource.
  • the N-layer attention module determines In the process of generating target representation vectors, each layer of representation vectors can be associated with the non-shared parameters of the previous layer, which can reduce the calculation amount of the attention recognition model and avoid excessive loss of the recognition model.
  • the self-attention weights of different layers are different according to needs, so that the performance is not weaker than or even better than the original recognition model, taking into account the technical effects of model performance and calculation amount, and thus solving the problem In related technologies, the attention recognition model speeds up the calculation process, resulting in technical problems such as large performance losses of the recognition model.
  • Figure 1 is an application environment of an optional attention module-based information identification method according to an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an optional attention module-based information identification method according to an embodiment of the present application
  • Figure 3 is a schematic diagram of an optional attention module-based information identification method according to an embodiment of the present application.
  • Figure 4 is a schematic diagram of yet another optional information identification method based on the attention module according to an embodiment of the present application.
  • Figure 5 is a schematic diagram of yet another optional information identification method based on the attention module according to an embodiment of the present application.
  • Figure 6 is a schematic diagram of yet another optional information identification method based on the attention module according to an embodiment of the present application.
  • Figure 7 is a schematic diagram of yet another optional information identification method based on the attention module according to an embodiment of the present application.
  • Figure 8 is a schematic diagram of yet another optional information identification method based on the attention module according to an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of an optional attention module-based information recognition device according to an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of an optional attention module-based information recognition product according to an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of an optional electronic device according to an embodiment of the present application.
  • Attention mechanism In fact, we want to apply human perception methods and attention behaviors to machines, so that machines can learn to perceive important and unimportant parts of data.
  • Self/Intra Attention (self-attention mechanism): The weight assigned to each input item depends on the interaction between the input items, that is, the "voting" within the input items determines which input items should be paid attention to. When input, it has the advantage of parallel computing.
  • an information identification method based on an attention module is provided.
  • the above information identification method based on the attention module can be applied to the information identification method as shown in Figure 1
  • the server 101 is connected to the terminal 103 through the network and can be used to provide services for the terminal device or the application program installed on the terminal device.
  • the application program can be a video application, an instant messaging application, a browser application, Educational apps, conferencing apps, and more.
  • the database 105 can be set up on the server or independently of the server to provide data storage services for the server 101, for example, a voice data storage server.
  • the above-mentioned network can include but is not limited to: a wired network, a wireless network, where the wired network includes: bureau Area network, metropolitan area network and wide area network.
  • the wireless network includes: Bluetooth, WIFI and other networks that implement wireless communication.
  • the terminal device 103 can be a terminal configured with an application program, and can include but is not limited to at least one of the following: a mobile phone (such as Android phones, iOS phones, etc.), laptops, tablets, handheld computers, MID (Mobile Internet Devices, mobile Internet devices), PAD, desktop computers, smart TVs, smart voice interaction devices, smart home appliances, vehicle-mounted terminals, aircraft and other computers Device, the above-mentioned server can be a single server, or a server cluster composed of multiple servers, or a cloud server.
  • the application 107 using the above-mentioned information recognition method based on the attention module passes through the terminal device 103 or other connected display devices. display.
  • the above information recognition method based on the attention module can be implemented on the terminal device 103 through the following steps:
  • the target information identification model includes N layers of attention modules, and N is greater than or equal to 2 a positive integer;
  • the target media resource features are processed through the N-layer attention module to obtain the target representation vector, where the i-th layer attention module in the N-layer attention module is used to calculate the target media resource characteristics according to a set of shared parameters and the i-th layer attention module.
  • the i-th layer representation vector is used to determine the i+1th group of non-shared parameters used by the i+1th layer attention module, when i is equal to N
  • the i-th layer representation vector is used to determine the target representation vector.
  • At least 2 layers of attention modules in the N-layer attention module share a set of shared parameters. At least 2 layers of attention modules include the i-th layer attention module;
  • S3 Determine the target information recognition result on the terminal device 103 according to the target representation vector, where the target information recognition result is used to represent the target information recognized from the target media resource.
  • the above information identification method based on the attention module can also be implemented by a server, for example, implemented in the server 101 shown in Figure 1; or implemented by the terminal device and the server jointly.
  • the above-mentioned information identification method based on the attention module includes:
  • S202 obtain the target media resource characteristics of the target media resource, and input the target media resource characteristics into the target information identification model, where the target information identification model includes an N-layer attention module, and N is a positive integer greater than or equal to 2;
  • S204 process the target media resource characteristics through the N-layer attention module to obtain the target representation vector, where the i-th layer attention module in the N-layer attention module is used to calculate the target media resource according to a set of shared parameters and the i-th set of non-shared parameters. , determine the i-th layer attention weight parameter and the i-th layer input representation vector, and determine the i-th layer representation vector output by the i-th layer attention module based on the i-th layer attention weight parameter and the i-th layer input representation vector, 1 ⁇ i ⁇ N, when i is less than N, the i-th layer representation vector is used to determine the i+1-th group of non-shared parameters used by the i+1-th layer attention module.
  • the i-th layer The representation vector is used to determine the target representation vector. At least 2 layers of attention modules in the N-layer attention module share a set of shared parameters. At least 2 layers of attention modules include the i-th layer attention module;
  • S206 Determine the target information recognition result according to the target representation vector, where the target information recognition result is used to represent the target information recognized from the target media resource.
  • the above-mentioned information recognition method based on the attention module may include, but is not limited to, application in voice conversation scenarios, emotion recognition scenarios, and image recognition scenarios in the field of cloud technology.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model. It can form a resource pool and use it on demand, which is flexible and convenient. Cloud computing technology will become an important support.
  • the background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites.
  • each item may have its own identification mark, which needs to be transmitted to the backend system for logical processing. Data at different levels will be processed separately, and all types of industry data need to be powerful. System backing support can only be achieved through cloud computing.
  • Cloud computing refers to the delivery and usage model of IT infrastructure, which refers to obtaining the required resources through the network in an on-demand and easily scalable manner;
  • cloud computing in a broad sense refers to the delivery and usage model of services, which refers to the on-demand and easily scalable method through the network. Get the services you need in an easily scalable way.
  • Such services can be IT, software, Internet-related, or other services.
  • Cloud computing is Grid Computing, Distributed Computing, Parallel Computing, Utility Computing, Network Storage Technologies, Virtualization, Load Balancing Balance) and other traditional computer and network technology development and integration products.
  • Cloud computing has developed rapidly with the development of the Internet, real-time data streams, diversification of connected devices, and the demand for search services, social networks, mobile commerce, and open collaboration. Different from the previous parallel distributed computing, the emergence of cloud computing will conceptually promote revolutionary changes in the entire Internet model and enterprise management model.
  • Cloud conference is an efficient, convenient and low-cost conference format based on cloud computing technology. Users only need to perform simple and easy-to-use operations through the Internet interface to quickly and efficiently share voice, data files and videos with teams and customers around the world. Complex technologies such as data transmission and processing in meetings are provided by cloud conferencing services. Providers help users operate.
  • the cloud conference system supports dynamic cluster deployment of multiple servers and provides multiple high-performance servers, which greatly improves conference stability, security, and availability.
  • video conferencing has been welcomed by many users because it can greatly improve communication efficiency, continuously reduce communication costs, and bring about upgrades in internal management levels. It has been widely used in various fields such as transportation, transportation, finance, operators, education, and enterprises. There is no doubt that after video conferencing uses cloud computing, it will become more attractive in terms of convenience, speed, and ease of use, which will surely trigger a new upsurge in video conferencing applications.
  • the above-mentioned cloud meeting scenario may include but is not limited to using artificial intelligence cloud services and using the end-to-end speech recognition model structure to realize automatic meeting minutes in the meeting.
  • AIaaS AI as a Service
  • Chinese is "AI as a service”
  • AIaaS AI as a Service
  • This service model is similar to opening an AI theme mall: all developers can access and use one or more artificial intelligence provided by the platform through the API interface. Services, some experienced developers can also use the AI framework and AI infrastructure provided by the platform to deploy and operate their own exclusive cloud artificial intelligence services.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • speech technology include automatic speech recognition technology (ASR), speech synthesis technology (TTS) and voiceprint recognition technology. Allowing computers to hear, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • voiceprint recognition technology Allowing computers to hear, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • the above-mentioned information identification method based on the attention module can include but is not limited to application scenarios such as remote training, remote consultation, emergency command, remote interviews, open classes, telemedicine, and business negotiations based on artificial intelligence.
  • Figure 3 is a schematic diagram of an optional information identification method based on the attention module according to the embodiment of the present application, as shown in Figure 3, taking the cloud conference scenario as an example , including: input device 302, processing device 304 and output device 306, where the input device 302 is used to obtain voice information sent by accounts participating in the cloud conference.
  • the voice information may include but is not limited to being obtained by a microphone or other voice input device, After the above-mentioned voice information is obtained, the above-mentioned voice information is input to the processing device 304 of the cloud server.
  • the processing device 304 may include but is not limited to a neural network model composed of a general Conformer/Transformer neural network structure.
  • the above-mentioned neural network model By converting the voice information Input the above-mentioned neural network model to obtain the representation vector output by the above-mentioned neural network model, and then process the above-mentioned representation vector to obtain the final recognition result, which is recorded in the database through the above-mentioned output device 306 and stored in the server as the above-mentioned automatic meeting minutes. middle.
  • the above-mentioned target media resources may include but are not limited to the voice information collected in the above-mentioned cloud conference scenario.
  • the above-mentioned target representation vector can be understood as a representation vector that can represent the above-mentioned voice information.
  • the above set of shared parameters may include but are not limited to W Q , W K , and W V parameters used in the attention mechanism.
  • the above parameters are used to train the above text recognition model. (corresponding to the aforementioned target recognition model), adjustments are made to determine each attention weight parameter based on the attention mechanism.
  • the above set of shared parameters is controlled to remain unchanged. , applied to each layer of attention modules in the N-layer attention module.
  • the above-mentioned i-th group of non-shared parameters can be understood as independently configured for each layer of the N-layer attention module, including but not limited to the speech representation parameter H i -1 of the i-1 middle layer. , may also include but are not limited to original speech features or speech representation parameters obtained through several layers of simple neural networks.
  • the above-mentioned i-th layer attention weight parameter may include but is not limited to the attention weight parameter A i of the i-th layer speech feature obtained after performing a normalization operation on Q i and K i .
  • the above-mentioned G i is the speech representation vector that needs to be input to the next layer of attention module.
  • the above-mentioned G i is used to determine the i+1th intermediate layer speech representation parameter Hi , and then determine G i through the above steps. +1 , and so on, until the G N output by the last layer of attention module is determined, which is used for downstream speech recognition tasks to obtain the speech recognition results.
  • the set of shared norms may include but are not limited to the above-mentioned speech recognition parameters to be learned: W Q , W K , W V .
  • an end-to-end speech recognition model structure based on Transformer the encoder (Encoder) can also use Conformer, by using the Multi-HeadAttention module of the Ne layer Transformer in the encoder (corresponding to the aforementioned attention module) shares a unified multi-head attention calculation module (sharing W Q , W K , W V , corresponding to the aforementioned set of shared parameters).
  • the encoder includes N e attention modules, and the decoder includes an encoder composed of N d attention modules.
  • the speech resources are input from Inputs.
  • the above speech features are obtained, and Input the speech features into Encoding, process the speech features through the N-layer attention module (Multi-Head Attention), obtain the speech representation vector G N , and generate the speech recognition result, or input G N into the decoder to Get speech recognition results.
  • N-layer attention module Multi-Head Attention
  • Figure 4 is a schematic diagram of another optional information recognition method based on the attention module according to the embodiment of the present application.
  • it is applied to the emotion recognition scenario.
  • it includes: an input device 402, a processing device 404, and an output device 406.
  • the input device 402 is used to obtain images that can express emotions.
  • the above image information is input to the processing device 404 of the cloud server.
  • the above-mentioned processing device 404 may include but is not limited to a neural network model composed of a neural network structure.
  • the representation vector output by the above-mentioned neural network model is obtained, and then the above-mentioned representation vector is processed to obtain The final recognition result is further processed through the above-mentioned output device 406 to save the recognized emotion information to the database.
  • the above-mentioned target media resources may include but are not limited to the image information collected in the above-mentioned emotion recognition scenario.
  • the above-mentioned target representation vector can be understood as a representation vector that can represent the above-mentioned image information.
  • the above set of shared parameters may include but are not limited to W Q , W K , and W V parameters used in the attention mechanism.
  • the above parameters are used to train the above text recognition model. (corresponding to the aforementioned target recognition model), adjustments are made to determine each attention weight parameter based on the attention mechanism.
  • the above set of shared parameters is controlled to remain unchanged. , applied to each layer of attention modules in the N-layer attention module.
  • the above-mentioned i-th group of non-shared parameters can be understood as independently configured for each layer of the N-layer attention module, including but not limited to the image representation parameter H i -1 of the i-1 middle layer. , may also include but are not limited to original image features or image representation parameters obtained through several layers of simple neural networks.
  • the above-mentioned i-th layer attention weight parameter may include but is not limited to the attention weight parameter A i of the i-th layer image feature obtained after performing a normalization operation on Q i and K i .
  • the above-mentioned G i is the image representation vector that needs to be input to the next layer of attention module.
  • the above-mentioned G i is used to determine the i+1th intermediate layer image representation parameter Hi , and then determine G i through the above steps. +1 , and so on, until the G N output by the last layer of attention module is determined, which is used for downstream image recognition tasks to obtain the image recognition results.
  • At least 2 layers of attention modules in the above-mentioned N-layer attention modules share a set of shared parameters.
  • the set of shared parameters may include but are not limited to the above-mentioned image recognition parameters to be learned: W Q , W K , W V.
  • the encoder can also use the Conformer, by using the Multi-HeadAttention module of the Ne layer Transformer in the encoder (corresponding to the aforementioned attention module) shares a unified multi-head attention calculation module (sharing W Q , W K , W V , corresponding to the aforementioned set of shared parameters).
  • the encoder includes N e attention modules, and the decoder includes an encoder composed of N d attention modules.
  • the image resources are input from Inputs.
  • the above information recognition method based on the attention module can also be applied to processing devices such as mobile phones, speakers, small appliances, embedded products, etc., which have limited computing resources and memory and cannot support large amounts of calculations. Recognize speech or image information to apply the recognized text, emotion types, objects, actions, etc. to downstream scenarios.
  • the above-mentioned target media resources may include but are not limited to media resources such as video, audio, and pictures to be recognized. Specifically, they may include but are not limited to voice information collected in a cloud conference scenario. , video information played in advertisements, and images collected in the security field to be identified, etc.
  • the target media resource features may include, but are not limited to, media resource features extracted by inputting a conventional neural network model to the target media resource, and may be expressed in the form of a vector, but are not limited to.
  • the above-mentioned target information recognition model may include but is not limited to being composed of a multi-layer attention module.
  • the above-mentioned N-layer attention module may be, but is not limited to, using a unified attention calculation module to complete the calculation task.
  • the above target information recognition model may include but is not limited to an end-to-end speech recognition model structure based on Transformer, in which the encoder (Encoder) may also use Conformer.
  • Figure 5 is a schematic diagram of another optional attention module-based information recognition method according to an embodiment of the present application.
  • the above-mentioned end-to-end speech recognition based on Transformer is composed of Ne attention modules. model structure.
  • the above-mentioned target representation vector can be understood as a representation vector that can represent the above-mentioned target media resource.
  • the above-mentioned target representation vector is input into the subsequent processing model to determine the recognition result and then generate business needs. text and other data.
  • the above set of shared parameters may include but is not limited to those used in the attention mechanism.
  • the W Q , W K , and W V parameters used are used to adjust when training the above target information recognition model to determine each attention weight parameter based on the attention mechanism.
  • the above target information recognition model is used to identify the target.
  • the above set of shared parameters remains unchanged and is applied to each layer of attention modules in the N-layer attention module.
  • FIG. 6 is a schematic diagram of another optional information identification method based on the attention module according to the embodiment of the present application.
  • each layer of attention module Multi-Head Attention
  • inputs Q, K , V are associated with W Q , W K , and W V respectively, and then the representation vector of this layer is obtained.
  • the above-mentioned i-th group of non-shared parameters can be understood as independently configured for each layer of attention modules in the N-layer attention module, and may include but are not limited to the i-1 intermediate layer characterization parameters.
  • H i-1 may also include but is not limited to original features or representation parameters obtained through several layers of simple neural networks.
  • the above-mentioned i-th layer attention weight parameter may include but is not limited to the i-th layer attention weight parameter A i obtained after performing a normalization operation on Q i and K i .
  • the above-mentioned i-th layer attention weight parameter A i The i-layer input representation vector may include but is not limited to Vi .
  • the above G i is the representation vector that needs to be input to the next layer of attention module.
  • the above G i is used to determine the i+1 intermediate layer representation parameter Hi , and then determine G i+1 through the above steps. , and so on, until the G N output by the last layer of attention module is determined, which can be used for downstream recognition tasks to obtain the target information recognition results.
  • the i-th layer representation vector is used to determine the i+1-th group of non-shared parameters used by the i+1-th layer attention module.
  • At least 2 layers of attention modules among the above-mentioned N-layer attention modules share a set of shared parameters, which may include but are not limited to the above-mentioned W Q , W K , and W V , in other words, W Q , W K , and W V in the above-mentioned N-layer attention module can be configured with multiple sets as shared parameters, or one set can be configured as shared parameters.
  • determining the target information recognition result based on the target representation vector may include but is not limited to directly generating the target information recognition result based on the target representation vector output by the encoder including the N-layer attention module, It may also include, but is not limited to, inputting the representation vector output by the encoder including the N-layer attention module into the decoder to generate a target information recognition result through the N-layer mask module and the N-layer attention module of the decoder.
  • the above target information identification result represents the target information identified from the target media resource, which may include but is not limited to semantic information included in the target media resource, semantic information included in the target media resource, Emotion type information, etc.
  • Figure 7 is a schematic diagram of yet another optional attention module-based information recognition method according to an embodiment of the present application. As shown in Figure 7, it includes an end-to-end speech recognition model structure based on Transformer, The Encoder can also use the Conformer, by making the Multi-HeadAttention module (corresponding to the aforementioned attention module) of the Ne layer Transformer in the encoder share a unified multi-head attention calculation module (sharing W Q , W K , W V , corresponding to the aforementioned set of shared parameters).
  • the multi-head attention module and the multi-head masked attention module (MaskedMulti-HeadAttention) on the right side of the decoder (Decoder) part of Figure 7 can each share a set of modules (sharing W Q , W K , W V ).
  • the encoder includes N e attention modules
  • the decoder includes N d attention modules.
  • the target media resource is input from the encoder, and after two Concv/2+ReLU (convolution layer and activation function) and Additional Module (optional neural network module), the above target media resource characteristics are obtained, and the target media
  • the resource features are input into Encoding, and the target media resource features are processed through the N-layer attention module (Multi-Head Attention) to obtain the target representation vector G N and generate the target information recognition result, or input G N into the decoder. , to obtain the target information recognition result.
  • N-layer attention module Multi-Head Attention
  • Figure 8 is a schematic diagram of yet another optional information identification method based on the attention module according to an embodiment of the present application.
  • the above set of shared parameters may include but is not limited to using self-attention
  • the unified calculation module is implemented, and the above-mentioned W Q , W K , and W V are stored in this module, so that the above parameters can be used to calculate the attention weight parameters of each layer separately.
  • the target media resource characteristics of the target media resource are obtained, and the target media resource characteristics are input into the target information identification model, where the target information identification model includes an N-layer attention module, and N is greater than or equal to 2 is a positive integer, and the target media resource features are processed through the N-layer attention module to obtain the target representation vector.
  • the i-th layer attention module in the N-layer attention module is used to calculate the target media resource characteristics according to a set of shared parameters and the i-th group of non- Sharing parameters, determines the i-th layer attention weight parameter and i-th layer input representation vector, and determines the i-th layer representation vector output by the i-th layer attention module based on the i-th layer attention weight parameter and i-th layer input representation vector, 1 ⁇ i ⁇ N, when i is less than N, the i-th layer representation vector is used to determine the i+1th group of non-shared parameters used by the i+1th layer attention module.
  • the i-th layer representation vector is used to determine the i+1th group of non-shared parameters used by the i+1th layer attention module.
  • the i-layer representation vector is used to determine the target representation vector, and the target media resource characteristics are used to determine the first set of non-shared parameters used by the first-layer attention module in the N-layer attention module, and at least 2 layers in the N-layer attention module
  • the attention module shares a set of shared parameters. At least 2 layers of attention modules include the i-th layer attention module.
  • the target information recognition result is determined, where the target information recognition result is used to represent the recognition from the target media resource.
  • the N-layer attention module can make each layer's representation vector consistent with the non-shared parameters of the previous layer in the process of determining the target representation vector.
  • the self-attention weights of different layers can be adjusted according to needs.
  • the performance is not weaker than or even better than the original recognition model, taking into account the technical effects of model performance and calculation amount, thereby solving the problem in related technologies that the attention recognition model in order to speed up the calculation process, resulting in greater performance loss of the recognition model question.
  • the i-th layer attention weight parameter and the i-th layer input representation vector are determined in the following way:
  • the i-th layer attention weight parameter determines the i-th layer attention weight parameter, where a set of shared parameters includes the first part of shared parameters and the second part of shared parameters, and the i-1th intermediate layer representation parameter It is the middle layer representation parameter determined based on the i-1th layer representation vector output by the i-1th layer attention module;
  • the i-th layer attention weight parameters and the i-th layer input representation vector are weighted and summed to obtain the i-th layer representation vector output by the i-th layer attention module.
  • the above-mentioned first part of the shared parameters can be understood as the above-mentioned W Q and W K
  • the above-mentioned i-1th middle layer characterization parameter can be understood as, that is, the G i-1 output by the previous layer.
  • Hi -1 output after passing through the feedforward neural network Among them, H i-1 is determined based on G i-1 .
  • A′ i is (W Q , W K , W V ) determine the i-th layer attention weight parameters
  • G obtains H after passing through FeedForward Network.
  • the above-mentioned second part of the shared parameters can be understood as the above-mentioned W V
  • the above-mentioned i-th layer input representation vector can be understood as V i , which is an intermediate layer determined based on the representation features input by the previous layer.
  • G i A′ i V i .
  • the first part of the shared parameters includes the first shared parameter W Q and the second shared parameter W K
  • the i-1th intermediate layer characterization parameter is Hi -1
  • the i-th layer attention weight parameter is determined based on the initial attention weight parameter A i and the i-1-th layer attention weight parameter A′ i-1 used in the i-1-th layer attention module.
  • the first part of the shared parameters includes the first shared parameter W Q and the second shared parameter W K
  • the i-1th intermediate layer characterization parameter is Hi-1
  • the first correlation parameter Q i and the second correlation parameter K i are normalized to obtain the initial attention weight parameter A i of the i-th layer attention module. Including but not limited to the following formulas:
  • Parameters including:
  • the initial attention weight parameter A i and the i-1th layer attention weight parameter A′ i-1 are weighted and summed to obtain the i-th layer attention weight parameter.
  • the i-th layer is determined based on the initial attention weight parameter A i and the i-1th layer attention weight parameter A′ i-1 used in the i-1th layer attention module.
  • Parameters W Q , W K , W V ); when ⁇ 0, it does not rely on the self-attention weight of the previous layer.
  • f can be other neural networks of any complexity.
  • the i+1-th layer attention weight parameter of the i+1-th layer attention module and the i-th layer attention module is determined as follows:
  • the i-th middle layer representation parameter determines the i+1th layer attention weight parameter, where the i-th middle layer representation parameter is determined based on the i-th layer representation vector output by the i-th layer attention module.
  • the i+1-th layer attention module can determine the i-th layer by using the same first part shared parameters and second part shared parameters as the i-th layer attention module.
  • the attention modules of each layer use the shared attention parameters (W Q , W K , W V ) to perform feature processing to obtain the representation vector of the layer.
  • the i-th layer attention weight parameter and the i-th layer input representation vector are determined in the following way:
  • the intermediate layer representation parameters determined by the layer representation vector, the i-th group of non-shared parameters include the i-1 intermediate layer representation parameters;
  • the i-th layer attention weight parameters and the i-th layer input representation vector are weighted and summed to obtain the i-th layer representation vector output by the i-th layer attention module.
  • the above-mentioned shared attention weight parameter can be understood as the above-mentioned A, and the weighting parameters used in the above-mentioned i-th layer attention module can include but are not limited to pre-configured Wi .
  • the function f enables different layers to obtain different final attention weights A i based on the same initial attention value A.
  • the sum of the shared attention weight parameters and the weighting parameters used in the i-th layer attention module is determined as the i-th layer attention weight parameters.
  • the selection method of f is relatively flexible.
  • methods also include:
  • the initial representation features are the features of the target media resource, or are features converted according to the features of the target media resource;
  • a set of shared parameters also includes a first part of shared parameters, and the first part of shared parameters includes a first shared parameter W Q and a second shared parameter W K , the initial characterization features are multiplied by W Q and W K respectively, and the first shared parameter is obtained.
  • a shared correlation parameter Q and a second shared correlation parameter K are multiplied by W Q and W K respectively, and the first shared parameter is obtained.
  • the first shared correlation parameter Q and the second shared correlation parameter K are normalized to obtain the shared attention weight parameter.
  • the above-mentioned initial characterization features may include, but are not limited to, target media resource features or features obtained by converting target media resource features into other neural network models.
  • the above-mentioned normalization process on the first shared correlation parameter Q and the second shared correlation parameter K, and the obtained shared attention weight parameter may include but is not limited to the following formula:
  • a i represents the shared attention weight parameter
  • d K represents the length of K.
  • the i+1-th layer attention weight parameter of the i+1-th layer attention module and the i-th layer attention module is determined as follows:
  • the i+1-th layer input representation vector is determined, where the i-th intermediate layer representation parameter is determined based on the i-th layer representation vector output by the i-th layer attention module.
  • Intermediate layer representation parameters, the i+1th group of non-shared parameters includes the i-th intermediate layer representation parameter;
  • the above-mentioned shared attention weight parameter can be understood as the above-mentioned A
  • the above-mentioned weighting parameter used in the i+1th layer attention module can be understood as Wi
  • the force weight parameter can be understood as A i
  • the above-mentioned second part shared parameter can be understood as W V
  • the above-mentioned i-th middle layer representation parameter can be understood as H i-1
  • the determination of the above-mentioned i+1-th layer input representation vector can be understood as is V i
  • the above i+1th layer representation vector can be understood as G i .
  • H represents the input of the attention module
  • W Q , W K , and W V represent the parameters to be learned in matrix form.
  • Q, K, V, and A are all intermediate calculation results
  • d K represents the length of K.
  • A′ i is the self-attention value of the i-th layer Transformer
  • f is a custom function
  • G is the result output of the self-attention module.
  • the attention modules of different layers of Transformers in the encoder share W Q , W K , and W V , and the function f allows the results of the previous layer to be referred to when calculating the attention of the current layer.
  • the second part of the shared parameters includes the third shared parameter W V and the i-1 intermediate layer representation parameter is Hi - 1, multiply Hi-1 and W V to obtain the i-th layer input representation vector. .
  • V i Hi -1 W V
  • the above method also includes:
  • the i-kth intermediate layer representation parameters are obtained, where, 1 ⁇ k ⁇ i, the i-kth intermediate layer representation parameters are based on the i-kth
  • the intermediate layer representation parameters are determined by the i-kth layer representation vector output by the layer attention module;
  • the i-1th layer representation vector and the i-kth intermediate layer representation parameter determine the i-1th intermediate layer representation parameter.
  • the above-mentioned i-1th layer characterization vector can be understood as G i-1
  • the above-mentioned ikth middle layer characterization parameter can be understood as Hik
  • the above-mentioned ikth layer characterization vector can be understood as G ik .
  • the G i-1 output by the "Multi-Head Attention” module is superimposed with the H ik from the ik-th layer attention module, and then passes through the "Layer Norm” module and the “Feed Forward” module to obtain H i-1 .
  • the target media resource features are processed through the N-layer attention module to obtain the target representation vector, including:
  • the j-th layer representation vector output by the j-th layer attention module in the M-layer attention module is determined as the p-th layer representation vector output by the p-th layer attention module, where the sharing relationship is used Indicates that the j-th layer representation vector output by the j-th layer attention module is shared with the p-th layer attention module.
  • the above-mentioned M-layer attention module can be configured in advance, so that the p-th layer attention module except the M-layer attention module in the N-layer attention module can be shared according to the pre-configured Relationship, the j-th layer representation vector output by the j-th layer attention module in the M-layer attention module is determined as the p-th layer representation vector output by the p-th layer attention module.
  • the attention weight parameters are not shared, but the parameters to be learned for calculating the attention weight parameters are shared, the amount of calculation will increase.
  • the self-attention weights of different layers are different as needed, so that the performance is not weaker than or even better than the attention model that directly shares self-attention weights.
  • the target media resource features are processed through the N-layer attention module to obtain the target representation vector, including:
  • the i-th layer attention module is the T head attention module and T is a positive integer greater than or equal to 2, through T
  • the head attention module determines T initial representation vectors of the i-th layer based on T subgroups of shared parameters and the i-th group of non-shared parameters, and weights and sums the T i-th layer initial representation vectors to obtain the i-th layer attention.
  • the above-mentioned N-layer attention modules may all be T-head attention modules, or part of them may be T-head attention modules.
  • the i-th layer attention module is a T-head attention module. Assign corresponding shared parameters to each single-chip attention model to determine T initial representation vectors of the i-th layer based on T subgroups of shared parameters and non-shared parameters. In turn, the T initial representation vectors of the i-th layer can be Weighted summation is performed to obtain the i-th layer representation vector output by the i-th layer attention module.
  • the self-attention unified computing module has two forms, taking the encoder as an example (the same applies to the decoder):
  • Layer-by-layer dependence mode that is, when calculating the attention of the current layer, the results of the previous layer can be referred to, making the attention more consistent and the training more stable.
  • H represents the input of the multi-head attention module (intermediate layer representation)
  • W Q , W K , and W V represent the parameters to be learned, in matrix form
  • Q, K, V, and A are all intermediate calculation results
  • d K represents The length of K
  • A′ i is the self-attention value of the i-th layer Transformer
  • f is a custom function
  • G is the result output of the self-attention module (still an intermediate layer representation).
  • Other single-chip attention calculations in the multi-head attention module are similar.
  • the multi-head attention modules of different layers of Transformers in the encoder share W Q , W K , and W V , and the function f allows the results of the previous layer to be referred to when calculating the attention of the current layer.
  • f can be other neural networks of any complexity.
  • H represents the input of the multi-head attention module (intermediate layer representation)
  • X represents the input of the entire encoder (usually The original speech features may go through several layers of simple neural networks)
  • W Q , W K , and W V represent the parameters to be learned, in matrix form
  • Q, K, V, and A are all intermediate calculation results
  • d K represents the length of K
  • a i is the self-attention value of the i-th layer Transformer
  • f is a custom function
  • f of each layer of Transformer is independent of each other
  • G is the result output of the self-attention module (still an intermediate layer representation).
  • Other single-chip attention calculations in the multi-head attention module are similar.
  • the multi-head attention modules of different layers of Transformer in the encoder share Q, K, and V.
  • the function f enables different layers to obtain different final attention weights A i based on the same initial attention value A.
  • the main factor affecting its computing efficiency is the calculation of the self-attention mechanism of layer-by-layer calculation.
  • the parallel computing mode of each layer can be used to obtain the original input. Obtaining all attention weights of other layers will greatly improve computational efficiency.
  • the model structure proposed in this application is better than the conventional model structure on multiple speech data sets, and has fewer model parameters, especially on small data sets.
  • the parallel computing model of each layer in this application greatly improves the computing efficiency.
  • the model structure proposed in this application has a faster convergence speed than the conventional model structure.
  • an attention module-based information identification device for implementing the above attention module-based information identification method is also provided.
  • the device includes:
  • the acquisition module 902 is used to obtain the target media resource characteristics of the target media resource, and input the target media resource characteristics into the target information identification model, wherein the target information identification model includes N layers of attention modules, and N is greater than Or a positive integer equal to 2;
  • the processing module 904 is configured to process the target media resource features through the N-layer attention module to obtain a target representation vector, wherein the i-th layer attention module in the N-layer attention module is used to perform processing according to an set of shared parameters and the i-th group of non-shared parameters, determine the i-th layer attention weight parameter and the i-th layer input representation vector, and determine the i-th layer attention weight parameter and the i-th layer input representation vector.
  • the i-th layer representation vector output by the i-th layer attention module 1 ⁇ i ⁇ N, when i is less than N, the i-th layer representation vector is used to determine the i-th layer used by the i+1-th layer attention module +1 set of non-shared parameters.
  • the i-th layer representation vector is used to determine the target representation vector.
  • At least 2 layers of attention modules in the N-layer attention module share the use of the A set of shared parameters, the at least 2-layer attention module includes the i-th layer attention module;
  • the determination module 906 is configured to determine a target information identification result according to the target representation vector, where the target information identification result is used to represent the target information identified from the target media resource.
  • processing module 904 is also used to:
  • the i-1 intermediate layer representation parameter is an intermediate layer representation parameter determined based on the i-1 layer representation vector output by the i-1 layer attention module;
  • processing module 904 is also used to:
  • H i-1 and W are respectively Multiply Q and W K to obtain the first correlation parameter Q i and the second correlation parameter K i used in the i-th layer attention module;
  • the i-th layer attention weight parameter is determined according to the initial attention weight parameter A i and the i-1-th layer attention weight parameter A′ i-1 used in the i-1-th layer attention module.
  • processing module 904 is also used to:
  • the processing module 904 is also used when the at least 2-layer attention module also includes the i+1th layer attention module:
  • the i+1th layer input representation vector is determined according to the second part of shared parameters and the i-th intermediate layer representation parameter, wherein the i+1th group of non-shared parameters includes the i-th intermediate layer representation parameter.
  • processing module 904 is also used to:
  • the i-th layer attention weight parameter is determined according to the shared attention weight parameter and the weighting parameter used in the i-th layer attention module, wherein the set of shared parameters includes the shared attention weight parameter and the i-th layer attention module. Two parts share parameters;
  • the output i-1th layer representation vector determines the obtained intermediate layer representation parameters, and the i-th group of non-shared parameters includes the i-1th intermediate layer representation parameters.
  • processing module 904 is also used to:
  • the sum of the shared attention weight parameter and the weighting parameter used in the i-th layer attention module is determined as the i-th layer attention weight parameter.
  • processing module 904 is also used to:
  • initial characterization features of the target media resource where the initial characterization features are features of the target media resource, or are features converted according to features of the target media resource;
  • the set of shared parameters also includes a first part of shared parameters, and the first part of shared parameters includes a first shared parameter.
  • W Q and second shared parameter W K multiply the initial characterization features by W Q and W K respectively to obtain the first shared correlation parameter Q and the second shared correlation parameter K;
  • the first shared correlation parameter Q and the second shared correlation parameter K are normalized to obtain the shared attention weight parameter.
  • the processing module 904 is also used when the at least 2-layer attention module also includes the i+1th layer attention module:
  • the i-th layer representation vector determines the obtained intermediate layer representation parameters, and the i+1th group of non-shared parameters includes the i-th intermediate layer representation parameter.
  • processing module 904 is also used to:
  • the second part of the shared parameters includes the third shared parameter W V and the i-1th intermediate layer characterization parameter is Hi -1 , multiply Hi -1 and W V to obtain the The i-th layer input representation vector.
  • processing module 904 is also used to:
  • the i-kth intermediate layer representation parameters are obtained, where 1 ⁇ k ⁇ i, the i-kth intermediate layer
  • the representation parameters are intermediate layer representation parameters determined based on the i-k-th layer representation vector output by the i-k-th layer attention module;
  • the i-1th intermediate layer characterization parameter is determined according to the i-1th layer representation vector and the i-kth intermediate layer representation parameter.
  • processing module 904 is also used to:
  • the at least 2 layers of attention modules are M-layer attention modules and M is less than N
  • the p-th layer attention module in the N-layer attention modules except the M-layer attention module perform the following steps:
  • the j-th layer representation vector output by the j-th layer attention module in the M-layer attention module is determined as the p-th layer representation vector output by the p-th layer attention module, where,
  • the sharing relationship is used to indicate that the j-th layer representation vector output by the j-th layer attention module is shared with the p-th layer attention module.
  • processing module 904 is also used to:
  • the i-th layer attention module is a T-head attention module and T is a positive integer greater than or equal to 2
  • the shared parameters of the T subgroup and the i set of non-shared parameters determine T i-th layer initial representation vectors, and perform a weighted sum of the T i-th layer initial representation vectors to obtain the i-th layer representation vector output by the i-th layer attention module
  • the group of shared parameters includes the T subgroup of shared parameters.
  • a computer program product includes a computer program/instructions containing program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network via communication portion 1009 and/or installed from removable media 1011.
  • the central processor 1001 When the computer program is executed by the central processor 1001, various functions provided by the embodiments of the present application are executed.
  • FIG. 10 schematically shows a structural block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.
  • the computer system 1000 includes a central processing unit 1001 (Central Processing Unit, CPU), which can be loaded into a random computer according to a program stored in a read-only memory 1002 (Read-Only Memory, ROM) or from a storage part 1008. Access the program in the memory 1003 (Random Access Memory, RAM) to perform various appropriate actions and processes. In the random access memory 1003, various programs and data required for system operation are also stored.
  • the central processing unit 1001, the read-only memory 1002 and the random access memory 1003 are connected to each other through a bus 1004.
  • the input/output interface 1005 Input/Output interface, ie, I/O interface
  • I/O interface input/output interface
  • the following components are connected to the input/output interface 1005: an input part 1006 including a keyboard, a mouse, etc.; an output part 1007 including a cathode ray tube (Cathode Ray Tube, CRT), a liquid crystal display (Liquid Crystal Display, LCD), etc., and a speaker, etc. ; a storage part 1008 including a hard disk, etc.; and a communication part 1009 including a network interface card such as a LAN card, a modem, etc.
  • the communication section 1009 performs communication processing via a network such as the Internet.
  • Driver 1100 is also connected to input/output interface 1005 as needed.
  • Removable media 1011 such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 1100 as needed, so that a computer program read therefrom is installed into the storage portion 1008 as needed.
  • the processes described in the respective method flow charts may be implemented as computer software programs.
  • embodiments of the present application include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication portion 1009 and/or installed from removable media 1011.
  • the central processor 1001 various functions defined in the system of the present application are executed.
  • an electronic device for implementing the above attention module-based information identification method is also provided.
  • the electronic device may be the terminal device or server shown in Figure 1 .
  • This embodiment is explained by taking the electronic device as a terminal device as an example.
  • the electronic device includes a memory 1102 and a processor 1104.
  • the memory 1102 stores a computer program.
  • the processor 1104 is configured to execute the steps in any of the above method embodiments through the computer program.
  • the above-mentioned electronic device may be located in at least one network device among multiple network devices of the computer network.
  • the above-mentioned processor may be configured to perform the following steps through a computer program:
  • the target information identification model includes an N-layer attention module, and N is a positive integer greater than or equal to 2;
  • the i-th layer attention module in the N-layer attention module is used to calculate the target media resource according to a set of shared parameters and the i-th set of non-shared parameters. , determine the i-th layer attention weight parameter and the i-th layer input representation vector, and determine the i-th layer representation vector output by the i-th layer attention module based on the i-th layer attention weight parameter and the i-th layer input representation vector, 1 ⁇ i ⁇ N, when i is less than N, the i-th layer representation vector is used to determine the i+1-th group of non-shared parameters used by the i+1-th layer attention module.
  • the i-th layer The representation vector is used to determine the target representation vector, and at least 2 layers of attention modules in the N-layer attention module share a set of shared Parameters, at least 2 layers of attention modules including the i-th layer attention module;
  • S3 Determine the target information recognition result according to the target representation vector, where the target information recognition result is used to represent the target information recognized from the target media resource.
  • the electronic device may also be a smart phone, a tablet computer, a handheld computer, a mobile Internet device (MID), or a PAD. and other terminal equipment.
  • MID mobile Internet device
  • PAD PAD
  • the memory 1102 can be used to store software programs and modules, such as the program instructions/modules corresponding to the information recognition method and device based on the attention module in the embodiment of the present application.
  • the processor 1104 runs the software programs stored in the memory 1102 and module to perform various functional applications and data processing, that is, to implement the above-mentioned information recognition method based on the attention module.
  • the above-mentioned transmission device 1106 is used to receive or send data via a network.
  • the above-mentioned electronic device also includes: a display 1108 for displaying the above-mentioned target information recognition result; and a connection bus 1110 for connecting various module components in the above-mentioned electronic device.
  • the above-mentioned terminal device or server may be a node in a distributed system, wherein the distributed system may be a blockchain system, and the blockchain system may be composed of multiple nodes communicating through a network.
  • a distributed system formed by formal connections.
  • nodes can form a point-to-point (P2P, Peer To Peer) network, and any form of computing equipment, such as servers, terminals and other electronic equipment, can become a node in the blockchain system by joining the point-to-point network.
  • P2P point-to-point
  • computing equipment such as servers, terminals and other electronic equipment
  • a computer-readable storage medium is provided.
  • a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform the above attention-based
  • the information recognition method based on the attention module is provided in various optional implementations of the module's information recognition.
  • Embodiments of the present application also provide a computer program product including a computer program, which when run on a computer causes the computer to execute the method provided in the above embodiments.
  • the storage media can include: flash disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
  • the integrated units in the above embodiments are implemented in the form of software functional units and sold or used as independent products, they can be stored in the above computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, It includes several instructions to cause one or more computer devices (which can be personal computers, servers or network devices, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the disclosed client can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the units or modules may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种基于注意力模块的信息识别方法和装置、存储介质及电子设备。本申请实施例可应用于车载、云技术、人工智能、智慧交通、辅助驾驶等各种场景,例如,基于并行计算的语音场景。其中,该方法包括:获取目标媒体资源的目标媒体资源特征,并将目标媒体资源特征输入到目标信息识别模型中,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,其中,第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力模块输出的第i层表征向量,根据目标表征向量,确定目标信息识别结果,其中,目标信息识别结果用于表示从目标媒体资源中识别到的目标信息。本申请解决了相关技术中注意力识别模型为了加速计算过程,导致识别模型的性能损失较多的技术问题。

Description

基于注意力模块的信息识别方法和相关装置
本申请要求于2022年06月21日提交中国专利局、申请号为202210705199.2、申请名称为“基于注意力模块的信息识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,具体而言,涉及基于注意力模块的信息识别。
背景技术
基于自注意力的识别模型在多种任务中表现出较大优势,自注意力机制是其表现优异的重要原因,然而自注意力机制的计算复杂度较大,导致整个识别模型计算效率较低,共享注意力是常用来进行计算加速的方法,当下较为常见的方案包括:共享自注意力权重,即将某一层自注意力层的注意力权重直接作为其他层的注意力权重,省去了其他层的注意力权重的计算。
共享自注意力权重方法在使用过程中,由于不同层的表征抽象程度不同,却使用完全相同的注意力权重,会导致识别模型的性能损失严重,使得识别结果难以达到预期效果。
因此,相关技术中存在注意力识别模型的识别过程中,为了加速计算过程,导致识别模型的性能损失较多的技术问题。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本申请实施例提供了一种基于注意力模块的信息识别方法和装置、存储介质及电子设备,以至少解决相关技术中注意力识别模型为了加速计算过程,导致识别模型的性能损失较多的技术问题。
根据本申请实施例的一个方面,提供了一种基于注意力模块的信息识别方法,包括:获取目标媒体资源的目标媒体资源特征,并将所述目标媒体资源特征输入到目标信息识别模型中,其中,所述目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,其中,所述N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据所述第i层注意力权重参数以及所述第i层输入表征向量确定所述第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,所述第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,所述第i层表征向量用于确定所述目标表征向量,所述N层注意力模块中的至少2层注意力模块共享使用所述一组共享参数,所述至少2层注意力模块包括所述第i层注意力模块;根据所述目标表征向量,确定目标信息识别结果,其中,所述目标信息识别结果用于表示从所述目标媒体资源中识别到的目标信息。
根据本申请实施例的另一方面,还提供了一种基于注意力模块的信息识别装置,包括:获取模块,用于获取目标媒体资源的目标媒体资源特征,并将所述目标媒体资源特征输入到目标信息识别模型中,其中,所述目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;处理模块,用于通过所述N层注意力模块对所述目标媒体资源特征进行处理, 得到目标表征向量,其中,所述N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据所述第i层注意力权重参数以及所述第i层输入表征向量确定所述第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,所述第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,所述第i层表征向量用于确定所述目标表征向量,所述N层注意力模块中的至少2层注意力模块共享使用所述一组共享参数,所述至少2层注意力模块包括所述第i层注意力模块;确定模块,用于根据所述目标表征向量,确定目标信息识别结果,其中,所述目标信息识别结果用于表示从所述目标媒体资源中识别到的目标信息。
根据本申请实施例的又一方面,还提供了一种计算机可读的存储介质,该计算机可读的存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述基于注意力模块的信息识别方法。
根据本申请实施例的又一方面,提供一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行如以上基于注意力模块的信息识别方法。
根据本申请实施例的又一方面,还提供了一种电子设备,包括存储器和处理器,上述存储器中存储有计算机程序,上述处理器被设置为通过所述计算机程序执行上述的基于注意力模块的信息识别方法。
在本申请实施例中,采用获取目标媒体资源的目标媒体资源特征,并将目标媒体资源特征输入到目标信息识别模型中,其中,目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,其中,N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,第i层表征向量用于确定目标表征向量,N层注意力模块中的至少2层注意力模块共享使用一组共享参数,至少2层注意力模块包括第i层注意力模块,根据目标表征向量,确定目标信息识别结果,其中,目标信息识别结果用于表示从目标媒体资源中识别到的目标信息的方式,通过确定一组共享参数以及N组非共享参数,使得N层注意力模块在确定目标表征向量的过程中,可以使得每层表征向量均与上一层的非共享参数相关联,达到了降低注意力识别模型的计算量的同时,还能够避免识别模型损失过多的目的,从而实现了降低识别模型的参数量的同时,不同层的自注意力权重根据需要有所不同,使得性能不弱于甚至优于原有识别模型,兼顾模型性能和计算量的技术效果,进而解决了相关技术中注意力识别模型为了加速计算过程,导致识别模型的性能损失较多的技术问题。
附图说明
图1是根据本申请实施例的一种可选的基于注意力模块的信息识别方法的应用环境的 示意图;
图2是根据本申请实施例的一种可选的基于注意力模块的信息识别方法的流程示意图;
图3是根据本申请实施例的一种可选的基于注意力模块的信息识别方法的示意图;
图4是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图;
图5是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图;
图6是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图;
图7是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图;
图8是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图;
图9是根据本申请实施例的一种可选的基于注意力模块的信息识别装置的结构示意图;
图10是根据本申请实施例的一种可选的基于注意力模块的信息识别产品的结构示意图;
图11是根据本申请实施例的一种可选的电子设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
首先,在对本申请实施例进行描述的过程中出现的部分名词或者术语适用于如下解释:
注意力机制:实际上就是想将人的感知方式、注意力的行为应用在机器上,让机器学会去感知数据中的重要和不重要的部分。
Self/Intra Attention(自注意力机制):对每个输入项分配的权重取决于输入项之间的相互作用,即通过输入项内部的"表决"来决定应该关注哪些输入项,在处理很长的输入时,具有并行计算的优势。
下面结合实施例对本申请进行说明:
根据本申请实施例的一个方面,提供了一种基于注意力模块的信息识别方法,可选地,在本实施例中,上述基于注意力模块的信息识别方法可以应用于如图1所示的由服务器101和终端设备103所构成的硬件环境中。如图1所示,服务器101通过网络与终端103进行连接,可用于为终端设备或终端设备上安装的应用程序提供服务,应用程序可以是视频应用程序、即时通信应用程序、浏览器应用程序、教育应用程序、会议应用程序等。可在服务器上或独立于服务器设置数据库105,用于为服务器101提供数据存储服务,例如,语音数据存储服务器,上述网络可以包括但不限于:有线网络,无线网络,其中,该有线网络包括:局 域网、城域网和广域网,该无线网络包括:蓝牙、WIFI及其他实现无线通信的网络,终端设备103可以是配置有应用程序的终端,可以包括但不限于以下至少之一:手机(如Android手机、iOS手机等)、笔记本电脑、平板电脑、掌上电脑、MID(Mobile Internet Devices,移动互联网设备)、PAD、台式电脑、智能电视、智能语音交互设备、智能家电、车载终端、飞行器等计算机设备,上述服务器可以是单一服务器,也可以是由多个服务器组成的服务器集群,或者是云服务器,使用上述基于注意力模块的信息识别方法的应用程序107通过终端设备103或其他连接的显示设备进行显示。
结合图1所示,上述基于注意力模块的信息识别方法可以在终端设备103通过如下步骤实现:
S1,在终端设备103上获取目标媒体资源的目标媒体资源特征,并将目标媒体资源特征输入到目标信息识别模型中,其中,目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;
S2,在终端设备103上通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,其中,N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,第i层表征向量用于确定目标表征向量,N层注意力模块中的至少2层注意力模块共享使用一组共享参数,至少2层注意力模块包括第i层注意力模块;
S3,在终端设备103上根据目标表征向量,确定目标信息识别结果,其中,目标信息识别结果用于表示从目标媒体资源中识别到的目标信息。
可选地,在本实施例中,上述基于注意力模块的信息识别方法还可以通过服务器实现,例如,图1所示的服务器101中实现;或由终端设备和服务器共同实现。
上述仅是一种示例,本实施例不做具体的限定。
可选地,作为一种可选的实施方式,如图2所示,上述基于注意力模块的信息识别方法包括:
S202,获取目标媒体资源的目标媒体资源特征,并将目标媒体资源特征输入到目标信息识别模型中,其中,目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;
S204,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,其中,N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,第i层表征向量用于确定目标表征向量,N层注意力模块中的至少2层注意力模块共享使用一组共享参数,至少2层注意力模块包括第i层注意力模块;
S206,根据目标表征向量,确定目标信息识别结果,其中,目标信息识别结果用于表示从目标媒体资源中识别到的目标信息。
可选的,在本申请实施例中,上述基于注意力模块的信息识别方法可以包括但不限于应用于云技术领域的语音会话场景、情绪识别场景、图像识别场景中。
云技术(Cloud technology)基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。
云计算(cloud computing)指IT基础设施的交付和使用模式,指通过网络以按需、易扩展的方式获得所需资源;广义云计算指服务的交付和使用模式,指通过网络以按需、易扩展的方式获得所需服务。这种服务可以是IT和软件、互联网相关,也可是其他服务。云计算是网格计算(Grid Computing)、分布式计算(DistributedComputing)、并行计算(Parallel Computing)、效用计算(Utility Computing)、网络存储(Network Storage Technologies)、虚拟化(Virtualization)、负载均衡(Load Balance)等传统计算机和网络技术发展融合的产物。
随着互联网、实时数据流、连接设备多样化的发展,以及搜索服务、社会网络、移动商务和开放协作等需求的推动,云计算迅速发展起来。不同于以往的并行分布式计算,云计算的产生从理念上将推动整个互联网模式、企业管理模式发生革命性的变革。
云会议是基于云计算技术的一种高效、便捷、低成本的会议形式。使用者只需要通过互联网界面,进行简单易用的操作,便可快速高效地与全球各地团队及客户同步分享语音、数据文件及视频,而会议中数据的传输、处理等复杂技术由云会议服务商帮助使用者进行操作。
目前国内云会议主要集中在以SaaS(Software as a Service,软件即服务)模式为主体的服务内容,包括电话、网络、视频等服务形式,基于云计算的视频会议就叫云会议。
在云会议时代,数据的传输、处理、存储全部由视频会议厂家的计算机资源处理,用户完全无需再购置昂贵的硬件和安装繁琐的软件,只需打开浏览器,登录相应界面,就能进行高效的远程会议。
云会议系统支持多服务器动态集群部署,并提供多台高性能服务器,大大提升了会议稳定性、安全性、可用性。近年来,视频会议因能大幅提高沟通效率,持续降低沟通成本,带来内部管理水平升级,而获得众多用户欢迎,已广泛应用在交通、运输、金融、运营商、教育、企业等各个领域。毫无疑问,视频会议运用云计算以后,在方便性、快捷性、易用性上具有更强的吸引力,必将激发视频会议应用新高潮的到来。
可选的,在本申请实施例中,在例如上述云会议场景中,可以包括但不限于通过人工智能云服务,利用端到端语音识别模型结构,来实现会议中的自动会议纪要。
所谓人工智能云服务,一般也被称作是AIaaS(AI as a Service,中文为“AI即服务”)。这是目前主流的一种人工智能平台的服务方式,具体来说AIaaS平台会把几类常见的AI服务进行拆分,并在云端提供独立或者打包的服务。这种服务模式类似于开了一个AI主题商城:所有的开发者都可以通过API接口的方式来接入使用平台提供的一种或者是多种人工智能 服务,部分资深的开发者还可以使用平台提供的AI框架和AI基础设施来部署和运维自已专属的云人工智能服务。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。其中,语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
示例性地,上述基于注意力模块的信息识别方法可以包括但不限于应用于基于人工智能的远程培训、远程会商、应急指挥、远程面试、公开课、远程医疗、商务洽谈等应用场景中。
可选的,在本申请实施例中,图3是根据本申请实施例的一种可选的基于注意力模块的信息识别方法的示意图,如图3所示,以应用于云会议场景为例,包括:输入设备302、处理设备304以及输出设备306,其中,输入设备302用于获取参与云会议的帐号发送的语音信息,该语音信息可以包括但不限于由麦克风或其他语音输入设备获取,当获取到上述语音信息后,将上述语音信息输入至云服务器的处理设备304,该处理设备304可以包括但不限于由基于通用的Conformer/Transformer神经网络结构构成的神经网络模型,通过将语音信息输入上述神经网络模型,以得到上述神经网络模型输出的表征向量,再对上述表征向量进行处理,得到最终的识别结果,并通过上述输出设备306记录在数据库中,作为上述自动会议纪要保存在服务器中。
需要说明的是,上述目标媒体资源可以包括但不限于上述云会议场景下收集到的语音信息,上述目标表征向量可以理解为能够表示上述语音信息的表征向量,通过将上述目标表征向量输入到云会议处理设备304中,以确定识别结果。
示例性的,上述一组共享参数可以包括但不限于注意力机制中所使用的WQ、WK、WV参数,其中,在云会议应用场景中,上述参数用于在训练上述文本识别模型(对应于前述的目标识别模型)中,进行调整,以确定基于注意力机制的各个注意力权重参数,在使用上述文本识别模型识别语音信息对应的特征时,控制上述一组共享参数保持不变,应用于N层注意力模块中的每层注意力模块。
在云会议场景中,上述第i组非共享参数可以理解为针对N层注意力模块中每层注意力模块均独立配置,包括但不限于第i-1中间层的语音表征参数Hi-1,也可以包括但不限于原始语音特征或者经过几层简单的神经网络所得到的语音表征参数。
上述第i层注意力权重参数可以包括但不限于对Qi、Ki执行归一化操作后得到的第i层语音特征的注意力权重参数Ai,上述第i层输入表征向量可以包括但不限于语音特征Vi,上述根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层语音表征向量Gi=A′iVi
需要说明的是,上述Gi即为需要输入至下一层注意力模块的语音表征向量,上述Gi用于确定第i+1中间层语音表征参数Hi,进而再通过上述步骤确定Gi+1,以此类推,直到确定出最后一层注意力模块输出的GN,以用于下游语音识别任务,得到语音识别结果。
在云会议场景中,上述N层注意力模块中的至少2层注意力模块共享使用一组共享参数,该组共享范数可以包括但不限于上述待学习语音识别参数:WQ、WK、WV
示例性的,一种基于Transformer的端到端语音识别模型结构,编码器(Encoder)也可以使用Conformer,通过使编码器中Ne层Transformer的多头注意力(Multi-HeadAttention)模块(对应于前述的注意力模块)共享一个统一的多头注意力计算模块(共享WQ、WK、WV,对应于前述的一组共享参数)。编码器包括Ne个注意力模块,解码器包括由Nd的个注意力模块组成编码器,语音资源从Inputs输入,经过两次Concv/2+ReLU和Additional Module后,得到上述语音特征,并将语音特征输入到Encoding中,通过N层注意力模块(Multi-Head Attention)对语音特征进行处理,得到语音表征向量GN,生成语音识别结果,或者,将GN输入至解码器中,以得到语音识别结果。
上述仅是一种示例,本申请实施例不做任何具体限定。
可选的,在本申请实施例中,图4是根据本申请实施例的另一种可选的基于注意力模块的信息识别方法的示意图,如图4所示,以应用于情绪识别场景为例,包括:输入设备402、处理设备404以及输出设备406,其中,输入设备402用于获取能够表达情绪的图像,当获取到上述图像信息后,将上述图像信息输入至云服务器的处理设备404,上述处理设备404可以包括但不限于由神经网络结构构成的神经网络模型,通过将图像信息输入上述神经网络模型,以得到上述神经网络模型输出的表征向量,再对上述表征向量进行处理,得到最终的识别结果,并通过上述输出设备406进行进一步处理,以将识别到的情绪信息保存至在数据库。
需要说明的是,上述目标媒体资源可以包括但不限于上述情绪识别场景下收集到的图像信息,上述目标表征向量可以理解为能够表示上述图像信息的表征向量,通过将上述目标表征向量输入到情绪识别处理设备304中,以确定识别结果。
示例性的,上述一组共享参数可以包括但不限于注意力机制中所使用的WQ、WK、WV参数,其中,在情绪识别应用场景中,上述参数用于在训练上述文本识别模型(对应于前述的目标识别模型)中,进行调整,以确定基于注意力机制的各个注意力权重参数,在使用上述文本识别模型识别图像信息对应的特征时,控制上述一组共享参数保持不变,应用于N层注意力模块中的每层注意力模块。
在情绪识别场景中,上述第i组非共享参数可以理解为针对N层注意力模块中每层注意力模块均独立配置,包括但不限于第i-1中间层的图像表征参数Hi-1,也可以包括但不限于原始图像特征或者经过几层简单的神经网络所得到的图像表征参数。
上述第i层注意力权重参数可以包括但不限于对Qi、Ki执行归一化操作后得到的第i层图像特征的注意力权重参数Ai,上述第i层输入表征向量可以包括但不限于图像特征Vi,上述根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层图像表征向量Gi=A′iVi
需要说明的是,上述Gi即为需要输入至下一层注意力模块的图像表征向量,上述Gi用于确定第i+1中间层图像表征参数Hi,进而再通过上述步骤确定Gi+1,以此类推,直到确定出最后一层注意力模块输出的GN,以用于下游图像识别任务,得到图像识别结果。
在情绪识别场景中,上述N层注意力模块中的至少2层注意力模块共享使用一组共享参数,该组共享参数可以包括但不限于上述待学习图像识别参数:WQ、WK、WV
示例性的,一种基于Transformer的端到端图像识别模型结构,编码器(Encoder)也可以使用Conformer,通过使编码器中Ne层Transformer的多头注意力(Multi-HeadAttention)模块(对应于前述的注意力模块)共享一个统一的多头注意力计算模块(共享WQ、WK、WV,对应于前述的一组共享参数)。编码器包括Ne个注意力模块,解码器包括由Nd的个注意力模块组成编码器,图像资源从Inputs输入,经过两次Concv/2+ReLU和Additional Module后,得到上述图像特征,并将图像特征输入到Encoding中,通过N层注意力模块(Multi-Head Attention)对图像特征进行处理,得到图像表征向量GN,生成图像识别结果,或者,将GN输入至解码器中,以得到图像识别结果。
上述仅是一种示例,本申请实施例不做任何具体限定。
需要说明的是,上述基于注意力模块的信息识别方法还可以应用于如手机、音箱、小家电、嵌入式产品等,计算资源、内存有限,无法支持较大计算量的处理设备中,用于识别语音或图像信息,以将识别到的文本、情绪类型、对象、动作等应用于下游场景中。
可选的,在本申请实施例中,上述目标媒体资源可以包括但不限于待识别的视频、音频、图片等媒体资源,具体而言,可以包括但不限于云会议场景下收集到的语音信息、广告中播放的视频信息以及安防领域中的采集到的待识别图片等。
可选的,在本申请实施例中,上述目标媒体资源特征可以包括但不限于对上述目标媒体资源输入常规神经网络模型中所提取到的媒体资源特征,可以但不限于采用向量的形式表示。
可选的,在本申请实施例中,上述目标信息识别模型可以包括但不限于由多层注意力模块组成,上述N层注意力模块可以但不限于采用统一的注意力计算模块完成计算任务,上述目标信息识别模型可以包括但不限于基于Transformer的端到端语音识别模型结构,其中编码器(Encoder)也可以使用Conformer。
例如,图5是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图,如图5所示,由Ne个注意力模块组成上述基于Transformer的端到端语音识别模型结构。
可选的,在本申请实施例中,上述目标表征向量可以理解为能够表示上述目标媒体资源的表征向量,通过将上述目标表征向量输入到后续处理模型中,以确定识别结果,进而生成业务需要的文字等数据。
可选的,在本申请实施例中,上述一组共享参数可以包括但不限于注意力机制中所使 用的WQ、WK、WV参数,上述参数用于在训练上述目标信息识别模型中,进行调整,以确定基于注意力机制的各个注意力权重参数,在使用上述目标信息识别模型识别目标媒体资源特征时,上述一组共享参数保持不变,应用于N层注意力模块中的每层注意力模块。
例如,图6是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图,如图6所示,每层注意力模块(Multi-Head Attention)均输入Q、K、V,分别与WQ、WK、WV关联,进而得到该层的表征向量。
可选的,在本申请实施例中,上述第i组非共享参数可以理解为针对N层注意力模块中每层注意力模块均独立配置,可以包括但不限于第i-1中间层表征参数Hi-1,也可以包括但不限于原始特征或者经过几层简单的神经网络所得到的的表征参数。
可选的,在本申请实施例中,上述第i层注意力权重参数可以包括但不限于对Qi、Ki执行归一化操作后得到的第i层注意力权重参数Ai,上述第i层输入表征向量可以包括但不限于Vi,上述根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层表征向量Gi=A′iVi
需要说明的是,上述Gi即为需要输入至下一层注意力模块的表征向量,上述Gi用于确定第i+1中间层表征参数Hi,进而再通过上述步骤确定Gi+1,以此类推,直到确定出最后一层注意力模块输出的GN,以用于下游识别任务,得到目标信息识别结果。
也即,上述在i小于N的情况下,第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,第i层表征向量用于确定目标表征向量可以理解为当Gi中的i<N时,Gi用于确定Hi,为当Gi中的i=N时,Gi用于确定GN
可选的,在本申请实施例中,上述N层注意力模块中的至少2层注意力模块共享使用一组共享参数,该组共享参数可以包括但不限于上述WQ、WK、WV,换言之,上述N层注意力模块中的WQ、WK、WV可以配置多套来作为共享参数,也可以配置1套来作为共享参数。
可选的,在本申请实施例中,上述根据目标表征向量,确定目标信息识别结果可以包括但不限于根据上述包括N层注意力模块的编码器输出的目标表征向量直接生成目标信息识别结果,还可以包括但不限于将上述包括N层注意力模块的编码器输出的表征向量输入解码器,以通过解码器的N层掩码模块以及N层注意力模块生成目标信息识别结果。
可选的,在本申请实施例中,上述目标信息识别结果表示从目标媒体资源中识别到的目标信息,可以包括但不限于目标媒体资源中所包括的语义信息、目标媒体资源中所包括的情绪类型信息等。
例如,图7是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图,如图7所示,其中,包括一种基于Transformer的端到端语音识别模型结构,编码器(Encoder)也可以使用Conformer,通过使编码器中Ne层Transformer的多头注意力(Multi-HeadAttention)模块(对应于前述的注意力模块)共享一个统一的多头注意力计算模块(共享WQ、WK、WV,对应于前述的一组共享参数)。类似地,图7右侧解码器(Decoder)部分的多头注意力模块和多头掩码注意力模块(MaskedMulti-HeadAttention)均可分别共享一组模块(共享WQ、WK、WV)。
需要说明的是,编码器包括Ne个注意力模块,解码器包括由Nd的个注意力模块组成编 码器,目标媒体资源从编码器输入,经过两次Concv/2+ReLU(卷积层和激活函数)和Additional Module(可选神经网络模块)后,得到上述目标媒体资源特征,并将目标媒体资源特征输入到Encoding中,通过N层注意力模块(Multi-Head Attention)对目标媒体资源特征进行处理,得到目标表征向量GN,生成目标信息识别结果,或者,将GN输入至解码器中,以得到目标信息识别结果。
示例性地,图8是根据本申请实施例的又一种可选的基于注意力模块的信息识别方法的示意图,如图8所示,上述一组共享参数可以包括但不限于采用自注意力统一计算模块实现,在该模块中存储上述WQ、WK、WV,以便采用上述参数对每层注意力权重参数分别进行计算。
通过本申请实施例,采用获取目标媒体资源的目标媒体资源特征,并将目标媒体资源特征输入到目标信息识别模型中,其中,目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,其中,N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,第i层表征向量用于确定目标表征向量,目标媒体资源特征用于确定N层注意力模块中的第1层注意力模块使用的第1组非共享参数,N层注意力模块中的至少2层注意力模块共享使用一组共享参数,至少2层注意力模块包括第i层注意力模块,根据目标表征向量,确定目标信息识别结果,其中,目标信息识别结果用于表示从目标媒体资源中识别到的目标信息的方式,通过确定一组共享参数以及N组非共享参数,使得N层注意力模块在确定目标表征向量的过程中,可以使得每层表征向量均与上一层的非共享参数相关联,达到了降低注意力识别模型的计算量的同时,还能够避免识别模型损失过多的目的,从而实现了降低识别模型的参数量的同时,不同层的自注意力权重根据需要有所不同,使得性能不弱于甚至优于原有识别模型,兼顾模型性能和计算量的技术效果,进而解决了相关技术中注意力识别模型为了加速计算过程,导致识别模型的性能损失较多的技术问题。
作为一种可选的方案,在i大于1的情况下,所述第i层注意力权重参数以及所述第i层输入表征向量通过如下方式确定:
根据第一部分共享参数以及第i-1中间层表征参数,确定第i层注意力权重参数,其中,一组共享参数包括第一部分共享参数和第二部分共享参数,第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数;
根据第二部分共享参数以及第i-1中间层表征参数,确定第i层输入表征向量,其中,第i组非共享参数包括第i-1中间层表征参数;
对第i层注意力权重参数以及第i层输入表征向量进行加权求和,得到第i层注意力模块输出的第i层表征向量。
可选的,在本申请实施例中,上述第一部分共享参数可以理解为上述WQ、WK,上述第i-1中间层表征参数可以理解为,也即上一层输出的Gi-1经过前馈神经网络后输出的Hi-1, 其中,Hi-1是根据Gi-1确定,对于(Multi-HeadAttention)来说:输入是H和上一层的attention值A′i,输出是G,其中,A′i是(WQ、WK、WV)确定的第i层注意力权重参数,G经过FeedForward Network后得到H。
可选的,在本申请实施例中,上述第i层注意力权重参数可以包括但不限于用A′i表示,其中,A′i=f(Ai,A′i-1),f的选取方式较为灵活,如f(Ai,A′i-1)=(1-α)Ai+αA′i-1,0≤α≤1。
可选的,在本申请实施例中,上述第二部分共享参数可以理解为上述WV,上述第i层输入表征向量可以理解为Vi,是根据上一层输入的表征特征确定的中间层表征,Gi=A′iVi
作为一种可选的方案,根据第一部分共享参数以及第i-1中间层表征参数,确定第i层注意力权重参数,包括:
在第一部分共享参数包括第一共享参数WQ和第二共享参数WK、且第i-1中间层表征参数为Hi-1的情况下,将Hi-1分别与WQ和WK相乘,得到第i层注意力模块中使用的第一相关性参数Qi和第二相关性参数Ki
对第一相关性参数Qi和第二相关性参数Ki进行归一化处理,得到第i层注意力模块的初始注意力权重参数Ai
根据初始注意力权重参数Ai以及第i-1层注意力模块中使用的第i-1层注意力权重参数A′i-1,确定第i层注意力权重参数。
可选的,在本申请实施例中,上述在第一部分共享参数包括第一共享参数WQ和第二共享参数WK、且第i-1中间层表征参数为Hi-1的情况下,将Hi-1分别与WQ和WK相乘,得到第i层注意力模块中使用的第一相关性参数Qi和第二相关性参数Ki可以包括但不限于如下公式,其中,WQ和WK均为矩阵形式:
Qi=Hi-1WQ
Ki=Hi-1WK
可选的,在本申请实施例中,上述对第一相关性参数Qi和第二相关性参数Ki进行归一化处理,得到第i层注意力模块的初始注意力权重参数Ai可以包括但不限于如下公式:
其中,Qi、Ki、Ai均为中间计算结果,dK表示K的长度。
作为一种可选的方案,根据初始注意力权重参数Ai以及第i-1层注意力模块中使用的第i-1层注意力权重参数A′i-1,确定第i层注意力权重参数,包括:
对初始注意力权重参数Ai和第i-1层注意力权重参数A′i-1进行加权求和,得到第i层注意力权重参数。
可选的,在本申请实施例中,上述根据初始注意力权重参数Ai以及第i-1层注意力模块中使用的第i-1层注意力权重参数A′i-1,确定第i层注意力权重参数可以包括但不限于如下公式:
A′i=f(Ai,A′i-1)
其中,f的选取方式较为灵活,如f(Ai,A′i-1)=(1-α)Ai+αA′i-1,0≤α≤1,当α=1时,则为常规自注意力权重值共享模式,也即,共享的是权重值而非用于计算权重值的待学习 参数WQ、WK、WV);当α=0时,则不依赖上一层自注意力权重。f可为其他任意复杂度的神经网络。
作为一种可选的方案,当至少2层注意力模块还包括第i+1层注意力模块时,所述第i+1层注意力模块的第i+1层注意力权重参数以及第i+1层输入表征向量通过如下方式确定:
根据第一部分共享参数以及第i中间层表征参数,确定第i+1层注意力权重参数,其中,第i中间层表征参数是根据第i层注意力模块输出的第i层表征向量确定得到的中间层表征参数;
根据第二部分共享参数以及第i中间层表征参数,确定第i+1层输入表征向量,其中,第i+1组非共享参数包括第i中间层表征参数;
对第i+1层注意力权重参数以及第i+1层输入表征向量进行加权求和,得到第i+1层注意力模块输出的第i+1层表征向量。
可选的,在本申请实施例中,上述第i+1层注意力模块可以通过采用与第i层注意力模块相同的第一部分共享参数和第二部分共享参数的方式,分别确定上述第i+1层注意力权重参数A′i+1和第i+1层输入表征向量Vi+1
也即,在本申请实施例中,各层注意力模块使用共享注意力参数(WQ、WK、WV)来进行特征处理,以得到该层的表征向量。
作为一种可选的方案,所述第i层注意力权重参数以及所述第i层输入表征向量通过如下方式确定:
根据共享注意力权重参数和第i层注意力模块中使用的加权参数,确定第i层注意力权重参数,其中,一组共享参数包括共享注意力权重参数和第二部分共享参数;
根据第二部分共享参数以及第i-1中间层表征参数,确定第i层输入表征向量,其中,第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数,第i组非共享参数包括第i-1中间层表征参数;
对第i层注意力权重参数以及第i层输入表征向量进行加权求和,得到第i层注意力模块输出的第i层表征向量。
可选的,在本申请实施例中,上述共享注意力权重参数可以理解为上述A,上述第i层注意力模块中使用的加权参数可以包括但不限于预先配置的Wi,此时,上述第i层注意力权重参数通过如下公式确定:
Ai=fi(A)
其中,函数f使得不同层基于相同的初始注意力值A获得不同的最终注意力权重Ai
可选的,在本申请实施例中,通过如下公式确定第i层输入表征向量:
Vi=Hi-1WV
其中,第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数,第i组非共享参数包括第i-1中间层表征参数,进而,Gi=AiVi。;
作为一种可选的方案,根据共享注意力权重参数和第i层注意力模块中使用的加权参数,确定第i层注意力权重参数,包括:
将共享注意力权重参数与第i层注意力模块中使用的加权参数之和确定为第i层注意力 权重参数。
示例性地,f的选取方式较为灵活,例如,将共享注意力权重参数与第i层注意力模块中使用的加权参数之和确定为第i层注意力权重参数,也即,fi(A)=A+Wi
作为一种可选的方案,方法还包括:
获取目标媒体资源的初始表征特征,其中,初始表征特征为目标媒体资源特征,或者,是根据目标媒体资源特征转换得到的特征;
在一组共享参数还包括第一部分共享参数、第一部分共享参数包括第一共享参数WQ和第二共享参数WK的情况下,将初始表征特征分别与WQ和WK相乘,得到第一共享相关性参数Q和第二共享相关性参数K;
对第一共享相关性参数Q和第二共享相关性参数K进行归一化处理,得到共享注意力权重参数。
可选的,在本申请实施例中,上述初始表征特征可以包括但不限于目标媒体资源特征或将目标媒体资源特征输入其他神经网络模型后进行转换得到的特征。
可选的,在本申请实施例中,上述对第一共享相关性参数Q和第二共享相关性参数K进行归一化处理,得到共享注意力权重参数可以包括但不限于如下公式:
其中,Ai表示共享注意力权重参数,dK表示K的长度。
作为一种可选的方案,当至少2层注意力模块还包括第i+1层注意力模块时,所述第i+1层注意力模块的第i+1层注意力权重参数以及第i+1层输入表征向量通过如下方式确定:
根据共享注意力权重参数和第i+1层注意力模块中使用的加权参数,确定第i+1层注意力权重参数;
根据第二部分共享参数以及第i中间层表征参数,确定第i+1层输入表征向量,其中,第i中间层表征参数是根据第i层注意力模块输出的第i层表征向量确定得到的中间层表征参数,第i+1组非共享参数包括第i中间层表征参数;
对第i+1层注意力权重参数以及第i+1层输入表征向量进行加权求和,得到第i+1层注意力模块输出的第i+1层表征向量。
可选的,在本申请实施例中,上述共享注意力权重参数可以理解为上述A,上述第i+1层注意力模块中使用的加权参数可以理解为Wi,上述第i+1层注意力权重参数可以理解为Ai,上述所述第二部分共享参数可以理解为WV,上述第i中间层表征参数可以理解为Hi-1,确定上述第i+1层输入表征向量可以理解为Vi,,上述第i+1层表征向量可以理解为Gi
也即,可以包括但不限于通过如下公式确定:
Qi=Hi-1WQ
Ki=Hi-1WK

A′i=f(Ai,A′i-1)
Gi=A′iVi
其中,H表示注意力模块的输入,WQ、WK、WV表示待学习参数,为矩阵形式, Q、K、V、A均为中间计算结果,dK表示K的长度。A′i即为第i层Transformer的自注意力值,f为自定义函数,G为自注意力模块的结果输出。编码器中不同层Transformer的注意力模块共享WQ、WK、WV,函数f使得计算当前层注意力时可参照上一层结果。f的选取方式较为灵活,如f(Ai,A′i-1)=(1-α)Ai+αA′i-1,0≤α≤1,f可为其他任意复杂度的神经网络。
作为一种可选的方案,根据第二部分共享参数以及第i-1中间层表征参数,确定第i层输入表征向量,包括:
在第二部分共享参数包括第三共享参数WV、且第i-1中间层表征参数为Hi-1的情况下,将Hi-1与WV相乘,得到第i层输入表征向量。
可选的,在本申请实施例中,可以包括但不限于通过如下公式确定:
Vi=Hi-1WV
作为一种可选的方案,上述方法还包括:
在获取到第i-1层注意力模块输出的第i-1层表征向量的情况下,获取第i-k中间层表征参数,其中,1<k<i,第i-k中间层表征参数是根据第i-k层注意力模块输出的第i-k层表征向量确定得到的中间层表征参数;
根据第i-1层表征向量和第i-k中间层表征参数,确定第i-1中间层表征参数。
可选的,在本申请实施例中,上述第i-1层表征向量可以理解为Gi-1,上述第i-k中间层表征参数可以理解为Hi-k,上述第i-k层表征向量可以理解为Gi-k
如图7所示,“Multi-Head Attention”模块输出的Gi-1,与来自第i-k层注意力模块的Hi-k进行叠加,然后经过“Layer Norm”模块和“Feed Forward”模块,得到Hi-1
作为一种可选的方案,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,包括:
在至少2层注意力模块为M层注意力模块、且M小于N的情况下,对于N层注意力模块中除M层注意力模块之外的第p层注意力模块,执行以下步骤:
根据预先配置的共享关系,将M层注意力模块中的第j层注意力模块输出的第j层表征向量确定为第p层注意力模块输出的第p层表征向量,其中,共享关系用于表示将第j层注意力模块输出的第j层表征向量共享给第p层注意力模块。
可选的,在本实施例中,上述M层注意力模块可以预先进行配置,以使得N层注意力模块中除M层注意力模块之外的第p层注意力模块,根据预先配置的共享关系,将M层注意力模块中的第j层注意力模块输出的第j层表征向量确定为第p层注意力模块输出的第p层表征向量。
也即,由于未共享注意力权重参数,而是共享用于计算注意力权重参数的待学习参数,因此,会使得计算量增加,此时,通过将邻近的注意力模块共享同一个计算结果,以达到降低参数量的同时,不同层的自注意力权重根据需要有所不同,使得性能不弱于甚至优于直接共享自注意力权重的注意力模型。
作为一种可选的方案,针对所述第i层注意力模块,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,包括:
在第i层注意力模块为T头注意力模块、且T为大于或等于2的正整数的情况下,通过T 头注意力模块,分别根据T子组共享参数以及第i组非共享参数,确定T个第i层初始表征向量,并将T个第i层初始表征向量进行加权求和,得到第i层注意力模块输出的第i层表征向量,其中,一组共享参数包括T子组共享参数。
可选的,在本实施例中,上述N层注意力模块可以全部是T头注意力模块,也可以是部分为T头注意力模块,在第i层注意力模块为T头注意力模块,为每个单片注意力模型分配对应的共享参数,以实现根据T子组共享参数和非共享参数来确定T个第i层初始表征向量,进而,能够将T个第i层初始表征向量进行加权求和,得到第i层注意力模块输出的第i层表征向量。
下面结合具体的实施例,对本申请进行进一步地解释说明:
本申请可用于线上会议中的自动会议纪要,如图8所示,该自注意力统一计算模块,具有两种形式,以编码器为例(解码器同理):
1)逐层依赖模式,即计算当前层注意力时可参照上一层结果,使得注意力更具有一致性,训练更稳定。
具体地,第i层Transformer的多头注意力模块中单片注意力计算方式为:
Qi=Hi-1WQ
Ki=Hi-1WK
Vi=Hi-1WV

A′i=f(Ai,A′i-1)
Gi=A′iVi
上式中H表示多头注意力模块的输入(中间层表征),WQ、WK、WV表示待学习参数,为矩阵形式,Q、K、V、A均为中间计算结果,dK表示K的长度。A′i即为第i层Transformer的自注意力值,f为自定义函数,G为自注意力模块的结果输出(仍是中间层表征)。多头注意力模块中其他单片注意力计算方式类似。编码器中不同层Transformer的多头注意力模块共享WQ、WK、WV,函数f使得计算当前层注意力时可参照上一层结果。f的选取方式较为灵活,如f(Ai,A′i-1)=(1-α)Ai+αA′i-1,0≤α≤1,当α=1时,则为注意力权重值共享模式;当α=0时,则不依赖上一层自注意力权重。f可为其他任意复杂度的神经网络。
由于计算量增加,邻近的几层可以共享同一个计算结果。
2)各层并行计算模式,具体地,第i层Transformer的多头注意力模块中单片注意力计算方式为:
Qi=XWQ
Ki=XWK
Vi=Hi-1WV

Ai=fi(A)
Gi=AiGi
上式中H表示多头注意力模块的输入(中间层表征),X表示整个编码器的输入(通常为 原始语音特征,可能经过几层简单的神经网络),WQ、WK、WV表示待学习参数,为矩阵形式,Q、K、V、A均为中间计算结果,dK表示K的长度。Ai即为第i层Transformer的自注意力值,f为自定义函数,每一层Transformer的f相互独立,G为自注意力模块的结果输出(仍是中间层表征)。多头注意力模块中其他单片注意力计算方式类似。编码器中不同层Transformer的多头注意力模块共享Q、K、V,函数f使得不同层基于相同的初始注意力值A获得不同的最终注意力权重Ai。f的选取方式较为灵活,如fi(A)=A+Wi;或其他任意复杂度的神经网络。
对于基于Conformer/Transformer结构的端到端语音识别系统来说,影响其计算效率的主要原因在于逐层计算的自注意力机制的计算,本申请中各层并行计算模式在获取原始输入时即可求得其他层所有注意力权重,将大大提高计算效率。
本申请提出的模型结构在多个语音数据集上优于常规模型结构,且具有更少的模型参数,小数据集上尤为明显。本申请中的各层并行计算模式大大提高了计算效率。
本申请提出的模型结构具有比常规模型结构更快的收敛速度。
可以理解的是,在本申请的具体实施方式中,涉及到用户信息等相关的数据,当本申请以上实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
根据本申请实施例的另一个方面,还提供了一种用于实施上述基于注意力模块的信息识别方法的基于注意力模块的信息识别装置。如图9所示,该装置包括:
获取模块902,用于获取目标媒体资源的目标媒体资源特征,并将所述目标媒体资源特征输入到目标信息识别模型中,其中,所述目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;
处理模块904,用于通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,其中,所述N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据所述第i层注意力权重参数以及所述第i层输入表征向量确定所述第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,所述第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,所述第i层表征向量用于确定所述目标表征向量,所述N层注意力模块中的至少2层注意力模块共享使用所述一组共享参数,所述至少2层注意力模块包括所述第i层注意力模块;
确定模块906,用于根据所述目标表征向量,确定目标信息识别结果,其中,所述目标信息识别结果用于表示从所述目标媒体资源中识别到的目标信息。
作为一种可选的方案,所述处理模块904还用于:
根据第一部分共享参数以及第i-1中间层表征参数,确定所述第i层注意力权重参数,其中,所述一组共享参数包括所述第一部分共享参数和第二部分共享参数,所述第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数;
根据所述第二部分共享参数以及所述第i-1中间层表征参数,确定所述第i层输入表征向量,其中,所述第i组非共享参数包括所述第i-1中间层表征参数。
作为一种可选的方案,所述处理模块904还用于:
在所述第一部分共享参数包括第一共享参数WQ和第二共享参数WK、且所述第i-1中间层表征参数为Hi-1的情况下,将Hi-1分别与WQ和WK相乘,得到所述第i层注意力模块中使用的第一相关性参数Qi和第二相关性参数Ki
对所述第一相关性参数Qi和所述第二相关性参数Ki进行归一化处理,得到所述第i层注意力模块的初始注意力权重参数Ai
根据所述初始注意力权重参数Ai以及所述第i-1层注意力模块中使用的第i-1层注意力权重参数A′i-1,确定所述第i层注意力权重参数。
作为一种可选的方案,所述处理模块904还用于:
对所述初始注意力权重参数Ai和所述第i-1层注意力权重参数A′i-1进行加权求和,得到所述第i层注意力权重参数。
作为一种可选的方案,所述处理模块904还用于当所述至少2层注意力模块还包括所述第i+1层注意力模块时:
根据所述第一部分共享参数以及第i中间层表征参数,确定第i+1层注意力权重参数,其中,所述第i中间层表征参数是根据所述第i层注意力模块输出的所述第i层表征向量确定得到的中间层表征参数;
根据所述第二部分共享参数以及所述第i中间层表征参数,确定第i+1层输入表征向量,其中,第i+1组非共享参数包括所述第i中间层表征参数。
作为一种可选的方案,所述处理模块904还用于:
根据共享注意力权重参数和所述第i层注意力模块中使用的加权参数,确定所述第i层注意力权重参数,其中,所述一组共享参数包括所述共享注意力权重参数和第二部分共享参数;
根据所述第二部分共享参数以及第i-1中间层表征参数,确定所述第i层输入表征向量,其中,所述第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数,所述第i组非共享参数包括所述第i-1中间层表征参数。
作为一种可选的方案,所述处理模块904还用于:
将所述共享注意力权重参数与所述第i层注意力模块中使用的加权参数之和确定为所述第i层注意力权重参数。
作为一种可选的方案,所述处理模块904还用于:
获取所述目标媒体资源的初始表征特征,其中,所述初始表征特征为所述目标媒体资源特征,或者,是根据所述目标媒体资源特征转换得到的特征;
在所述一组共享参数还包括第一部分共享参数、所述第一部分共享参数包括第一共享 参数WQ和第二共享参数WK的情况下,将所述初始表征特征分别与WQ和WK相乘,得到第一共享相关性参数Q和第二共享相关性参数K;
对所述第一共享相关性参数Q和第二共享相关性参数K进行归一化处理,得到所述共享注意力权重参数。
作为一种可选的方案,所述处理模块904还用于当所述至少2层注意力模块还包括所述第i+1层注意力模块时:
根据所述共享注意力权重参数和所述第i+1层注意力模块中使用的加权参数,确定第i+1层注意力权重参数;
根据所述第二部分共享参数以及第i中间层表征参数,确定第i+1层输入表征向量,其中,所述第i中间层表征参数是根据所述第i层注意力模块输出的所述第i层表征向量确定得到的中间层表征参数,所述第i+1组非共享参数包括所述第i中间层表征参数。
作为一种可选的方案,所述处理模块904还用于:
在所述第二部分共享参数包括第三共享参数WV、且所述第i-1中间层表征参数为Hi-1的情况下,将Hi-1与WV相乘,得到所述第i层输入表征向量。
作为一种可选的方案,所述处理模块904还用于:
在获取到所述第i-1层注意力模块输出的所述第i-1层表征向量的情况下,获取第i-k中间层表征参数,其中,1<k<i,所述第i-k中间层表征参数是根据第i-k层注意力模块输出的第i-k层表征向量确定得到的中间层表征参数;
根据所述第i-1层表征向量和第i-k中间层表征参数,确定所述第i-1中间层表征参数。
作为一种可选的方案,所述处理模块904还用于:
在所述至少2层注意力模块为M层注意力模块、且M小于N的情况下,对于所述N层注意力模块中除所述M层注意力模块之外的第p层注意力模块,执行以下步骤:
根据预先配置的共享关系,将所述M层注意力模块中的第j层注意力模块输出的第j层表征向量确定为所述第p层注意力模块输出的第p层表征向量,其中,所述共享关系用于表示将所述第j层注意力模块输出的所述第j层表征向量共享给所述第p层注意力模块。
作为一种可选的方案,所述处理模块904还用于:
在所述第i层注意力模块为T头注意力模块、且T为大于或等于2的正整数的情况下,通过所述T头注意力模块,分别根据T子组共享参数以及所述第i组非共享参数,确定T个第i层初始表征向量,并将所述T个第i层初始表征向量进行加权求和,得到所述第i层注意力模块输出的第i层表征向量,其中,所述一组共享参数包括所述T子组共享参数。
根据本申请的一个方面,提供了一种计算机程序产品,该计算机程序产品包括计算机程序/指令,该计算机程序/指令包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1009从网络上被下载和安装,和/或从可拆卸介质1011被安装。在该计算机程序被中央处理器1001执行时,执行本申请实施例提供的各种功能。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
图10示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。
需要说明的是,图10示出的电子设备的计算机系统1000仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图10所示,计算机系统1000包括中央处理器1001(Central Processing Unit,CPU),其可以根据存储在只读存储器1002(Read-Only Memory,ROM)中的程序或者从存储部分1008加载到随机访问存储器1003(Random Access Memory,RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器1003中,还存储有系统操作所需的各种程序和数据。中央处理器1001、在只读存储器1002以及随机访问存储器1003通过总线1004彼此相连。输入/输出接口1005(Input/Output接口,即I/O接口)也连接至总线1004。
以下部件连接至输入/输出接口1005:包括键盘、鼠标等的输入部分1006;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1007;包括硬盘等的存储部分1008;以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分1009。通信部分1009经由诸如因特网的网络执行通信处理。驱动器1100也根据需要连接至输入/输出接口1005。可拆卸介质1011,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1100上,以便于从其上读出的计算机程序根据需要被安装入存储部分1008。
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1009从网络上被下载和安装,和/或从可拆卸介质1011被安装。在该计算机程序被中央处理器1001执行时,执行本申请的系统中限定的各种功能。
根据本申请实施例的又一个方面,还提供了一种用于实施上述基于注意力模块的信息识别方法的电子设备,该电子设备可以是图1所示的终端设备或服务器。本实施例以该电子设备为终端设备为例来说明。如图11所示,该电子设备包括存储器1102和处理器1104,该存储器1102中存储有计算机程序,该处理器1104被设置为通过计算机程序执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述电子设备可以位于计算机网络的多个网络设备中的至少一个网络设备。
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
S1,获取目标媒体资源的目标媒体资源特征,并将目标媒体资源特征输入到目标信息识别模型中,其中,目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;
S2,通过N层注意力模块对目标媒体资源特征进行处理,得到目标表征向量,其中,N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据第i层注意力权重参数以及第i层输入表征向量确定第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,第i层表征向量用于确定目标表征向量,N层注意力模块中的至少2层注意力模块共享使用一组共享 参数,至少2层注意力模块包括第i层注意力模块;
S3,根据目标表征向量,确定目标信息识别结果,其中,目标信息识别结果用于表示从目标媒体资源中识别到的目标信息。
可选地,本领域普通技术人员可以理解,图11所示的结构仅为示意,电子装置电子设备也可以是智能手机、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。
其中,存储器1102可用于存储软件程序以及模块,如本申请实施例中的基于注意力模块的信息识别方法和装置对应的程序指令/模块,处理器1104通过运行存储在存储器1102内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的基于注意力模块的信息识别方法。
可选地,上述的传输装置1106用于经由一个网络接收或者发送数据。
此外,上述电子设备还包括:显示器1108,用于显示上述目标信息识别结果;和连接总线1110,用于连接上述电子设备中的各个模块部件。
在其他实施例中,上述终端设备或者服务器可以是一个分布式系统中的一个节点,其中,该分布式系统可以为区块链系统,该区块链系统可以是由该多个节点通过网络通信的形式连接形成的分布式系统。其中,节点之间可以组成点对点(P2P,Peer To Peer)网络,任意形式的计算设备,比如服务器、终端等电子设备都可以通过加入该点对点网络而成为该区块链系统中的一个节点。
根据本申请的一个方面,提供了一种计算机可读存储介质,计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述基于注意力模块的信息识别方面的各种可选实现方式中提供的基于注意力模块的信息识别方法。
本申请实施例还提供了一种包括计算机程序的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例提供的方法。
可选地,在本实施例中,本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (17)

  1. 一种基于注意力模块的信息识别方法,所述方法由计算机设备执行,所述方法包括:
    获取目标媒体资源的目标媒体资源特征,并将所述目标媒体资源特征输入到目标信息识别模型中,其中,所述目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;
    通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,其中,所述N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据所述第i层注意力权重参数以及所述第i层输入表征向量确定所述第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,所述第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,所述第i层表征向量用于确定所述目标表征向量,所述N层注意力模块中的至少2层注意力模块共享使用所述一组共享参数,所述至少2层注意力模块包括所述第i层注意力模块;
    根据所述目标表征向量,确定目标信息识别结果,其中,所述目标信息识别结果用于表示从所述目标媒体资源中识别到的目标信息。
  2. 根据权利要求1所述的方法,
    所述第i层注意力权重参数以及所述第i层输入表征向量通过如下方式确定:
    根据第一部分共享参数以及第i-1中间层表征参数,确定所述第i层注意力权重参数,其中,所述一组共享参数包括所述第一部分共享参数和第二部分共享参数,所述第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数;
    根据所述第二部分共享参数以及所述第i-1中间层表征参数,确定所述第i层输入表征向量,其中,所述第i组非共享参数包括所述第i-1中间层表征参数。
  3. 根据权利要求2所述的方法,所述根据第一部分共享参数以及第i-1中间层表征参数,确定所述第i层注意力权重参数,包括:
    在所述第一部分共享参数包括第一共享参数WQ和第二共享参数WK、且所述第i-1中间层表征参数为Hi-1的情况下,将Hi-1分别与WQ和WK相乘,得到所述第i层注意力模块中使用的第一相关性参数Qi和第二相关性参数Ki
    对所述第一相关性参数Qi和所述第二相关性参数Ki进行归一化处理,得到所述第i层注意力模块的初始注意力权重参数Ai
    根据所述初始注意力权重参数Ai以及所述第i-1层注意力模块中使用的第i-1层注意力权重参数A′i-1,确定所述第i层注意力权重参数。
  4. 根据权利要求3所述的方法,所述根据所述初始注意力权重参数Ai以及所述第i-1层注意力模块中使用的第i-1层注意力权重参数A′i-1,确定所述第i层注意力权重参数,包括:
    对所述初始注意力权重参数Ai和所述第i-1层注意力权重参数A′i-1进行加权求和,得到所述第i层注意力权重参数。
  5. 根据权利要求2所述的方法,当所述至少2层注意力模块还包括所述第i+1层注意力模块时,所述第i+1层注意力模块的第i+1层注意力权重参数以及第i+1层输入表征向量通过如 下方式确定:
    根据所述第一部分共享参数以及第i中间层表征参数,确定第i+1层注意力权重参数,其中,所述第i中间层表征参数是根据所述第i层注意力模块输出的所述第i层表征向量确定得到的中间层表征参数;
    根据所述第二部分共享参数以及所述第i中间层表征参数,确定第i+1层输入表征向量,其中,第i+1组非共享参数包括所述第i中间层表征参数。
  6. 根据权利要求1所述的方法,所述第i层注意力权重参数以及所述第i层输入表征向量通过如下方式确定:
    根据共享注意力权重参数和所述第i层注意力模块中使用的加权参数,确定所述第i层注意力权重参数,其中,所述一组共享参数包括所述共享注意力权重参数和第二部分共享参数;
    根据所述第二部分共享参数以及第i-1中间层表征参数,确定所述第i层输入表征向量,其中,所述第i-1中间层表征参数是根据第i-1层注意力模块输出的第i-1层表征向量确定得到的中间层表征参数,所述第i组非共享参数包括所述第i-1中间层表征参数。
  7. 根据权利要求6所述的方法,所述根据共享注意力权重参数和所述第i层注意力模块中使用的加权参数,确定所述第i层注意力权重参数,包括:
    将所述共享注意力权重参数与所述第i层注意力模块中使用的加权参数之和确定为所述第i层注意力权重参数。
  8. 根据权利要求6所述的方法,所述方法还包括:
    获取所述目标媒体资源的初始表征特征,其中,所述初始表征特征为所述目标媒体资源特征,或者,是根据所述目标媒体资源特征转换得到的特征;
    在所述一组共享参数还包括第一部分共享参数、所述第一部分共享参数包括第一共享参数WQ和第二共享参数WK的情况下,将所述初始表征特征分别与WQ和WK相乘,得到第一共享相关性参数Q和第二共享相关性参数K;
    对所述第一共享相关性参数Q和第二共享相关性参数K进行归一化处理,得到所述共享注意力权重参数。
  9. 根据权利要求6所述的方法,所述通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,包括:
    当所述至少2层注意力模块还包括所述第i+1层注意力模块时,所述第i+1层注意力模块的第i+1层注意力权重参数以及第i+1层输入表征向量通过如下方式确定:
    根据所述共享注意力权重参数和所述第i+1层注意力模块中使用的加权参数,确定第i+1层注意力权重参数;
    根据所述第二部分共享参数以及第i中间层表征参数,确定第i+1层输入表征向量,其中,所述第i中间层表征参数是根据所述第i层注意力模块输出的所述第i层表征向量确定得到的中间层表征参数,所述第i+1组非共享参数包括所述第i中间层表征参数。
  10. 根据权利要求2或6所述的方法,所述根据所述第二部分共享参数以及第i-1中间层表征参数,确定所述第i层输入表征向量,包括:
    在所述第二部分共享参数包括第三共享参数WV、且所述第i-1中间层表征参数为Hi-1的情况下,将Hi-1与WV相乘,得到所述第i层输入表征向量。
  11. 根据权利要求2或6所述的方法,所述方法还包括:
    在获取到所述第i-1层注意力模块输出的所述第i-1层表征向量的情况下,获取第i-k中间层表征参数,其中,1<k<i,所述第i-k中间层表征参数是根据第i-k层注意力模块输出的第i-k层表征向量确定得到的中间层表征参数;
    根据所述第i-1层表征向量和第i-k中间层表征参数,确定所述第i-1中间层表征参数。
  12. 根据权利要求1至9中任一项所述的方法,所述通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,包括:
    在所述至少2层注意力模块为M层注意力模块、且M小于N的情况下,对于所述N层注意力模块中除所述M层注意力模块之外的第p层注意力模块,执行以下步骤:
    根据预先配置的共享关系,将所述M层注意力模块中的第j层注意力模块输出的第j层表征向量确定为所述第p层注意力模块输出的第p层表征向量,其中,所述共享关系用于表示将所述第j层注意力模块输出的所述第j层表征向量共享给所述第p层注意力模块。
  13. 根据权利要求1至9中任一项所述的方法,针对所述第i层注意力模块,所述通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,包括:
    在所述第i层注意力模块为T头注意力模块、且T为大于或等于2的正整数的情况下,通过所述T头注意力模块,分别根据T子组共享参数以及所述第i组非共享参数,确定T个第i层初始表征向量,并将所述T个第i层初始表征向量进行加权求和,得到所述第i层注意力模块输出的第i层表征向量,其中,所述一组共享参数包括所述T子组共享参数。
  14. 一种基于注意力模块的信息识别装置,包括:
    获取模块,用于获取目标媒体资源的目标媒体资源特征,并将所述目标媒体资源特征输入到目标信息识别模型中,其中,所述目标信息识别模型包括N层注意力模块,N为大于或等于2的正整数;
    处理模块,用于通过所述N层注意力模块对所述目标媒体资源特征进行处理,得到目标表征向量,其中,所述N层注意力模块中的第i层注意力模块用于根据一组共享参数以及第i组非共享参数,确定第i层注意力权重参数以及第i层输入表征向量,并根据所述第i层注意力权重参数以及所述第i层输入表征向量确定所述第i层注意力模块输出的第i层表征向量,1≤i≤N,在i小于N的情况下,所述第i层表征向量用于确定第i+1层注意力模块使用的第i+1组非共享参数,在i等于N的情况下,所述第i层表征向量用于确定所述目标表征向量,所述目标媒体资源特征用于确定所述N层注意力模块中的第1层注意力模块使用的第1组非共享参数,所述N层注意力模块中的至少2层注意力模块共享使用所述一组共享参数,所述至少2层注意力模块包括所述第i层注意力模块;
    确定模块,用于根据所述目标表征向量,确定目标信息识别结果,其中,所述目标信息识别结果用于表示从所述目标媒体资源中识别到的目标信息。
  15. 一种计算机可读的存储介质,所述计算机可读的存储介质包括存储的计算机程序,其中,所述计算机程序可被终端设备或计算机运行时执行所述权利要求1至13任一项中所述 的方法。
  16. 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现权利要求1至13任一项中所述方法的步骤。
  17. 一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行所述权利要求1至13任一项中所述的方法。
PCT/CN2023/089375 2022-06-21 2023-04-20 基于注意力模块的信息识别方法和相关装置 WO2023246264A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210705199.2 2022-06-21
CN202210705199.2A CN117312582A (zh) 2022-06-21 2022-06-21 基于注意力模块的信息识别方法和装置

Publications (1)

Publication Number Publication Date
WO2023246264A1 true WO2023246264A1 (zh) 2023-12-28

Family

ID=89272429

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/089375 WO2023246264A1 (zh) 2022-06-21 2023-04-20 基于注意力模块的信息识别方法和相关装置

Country Status (2)

Country Link
CN (1) CN117312582A (zh)
WO (1) WO2023246264A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287814A (zh) * 2019-06-04 2019-09-27 北方工业大学 一种基于图像目标特征和多层注意力机制的视觉问答方法
CN110765359A (zh) * 2019-10-30 2020-02-07 北京速途网络科技股份有限公司 新媒体内容推荐方法和系统
CN111291189A (zh) * 2020-03-10 2020-06-16 北京芯盾时代科技有限公司 一种文本处理方法、设备及计算机可读存储介质
US20200356724A1 (en) * 2019-05-06 2020-11-12 University Of Electronic Science And Technology Of China Multi-hop attention and depth model, method, storage medium and terminal for classification of target sentiments
CN113435203A (zh) * 2021-08-30 2021-09-24 华南师范大学 多模态命名实体识别方法、装置以及电子设备
CN114239599A (zh) * 2021-12-17 2022-03-25 深圳壹账通智能科技有限公司 一种机器阅读理解的实现方法、系统、设备及介质
CN114329148A (zh) * 2021-10-28 2022-04-12 腾讯科技(深圳)有限公司 内容信息识别方法、装置、计算机设备和存储介质
CN114579714A (zh) * 2020-12-01 2022-06-03 广州视源电子科技股份有限公司 一种机器阅读理解方法、装置、设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200356724A1 (en) * 2019-05-06 2020-11-12 University Of Electronic Science And Technology Of China Multi-hop attention and depth model, method, storage medium and terminal for classification of target sentiments
CN110287814A (zh) * 2019-06-04 2019-09-27 北方工业大学 一种基于图像目标特征和多层注意力机制的视觉问答方法
CN110765359A (zh) * 2019-10-30 2020-02-07 北京速途网络科技股份有限公司 新媒体内容推荐方法和系统
CN111291189A (zh) * 2020-03-10 2020-06-16 北京芯盾时代科技有限公司 一种文本处理方法、设备及计算机可读存储介质
CN114579714A (zh) * 2020-12-01 2022-06-03 广州视源电子科技股份有限公司 一种机器阅读理解方法、装置、设备及存储介质
CN113435203A (zh) * 2021-08-30 2021-09-24 华南师范大学 多模态命名实体识别方法、装置以及电子设备
CN114329148A (zh) * 2021-10-28 2022-04-12 腾讯科技(深圳)有限公司 内容信息识别方法、装置、计算机设备和存储介质
CN114239599A (zh) * 2021-12-17 2022-03-25 深圳壹账通智能科技有限公司 一种机器阅读理解的实现方法、系统、设备及介质

Also Published As

Publication number Publication date
CN117312582A (zh) 2023-12-29

Similar Documents

Publication Publication Date Title
US11765113B2 (en) Assistance during audio and video calls
US11113080B2 (en) Context based adaptive virtual reality (VR) assistant in VR environments
JP6889281B2 (ja) 代替インタフェースでのプレゼンテーションのための電子会話の解析
US11151765B2 (en) Method and apparatus for generating information
US10938725B2 (en) Load balancing multimedia conferencing system, device, and methods
US11741949B2 (en) Real-time video conference chat filtering using machine learning models
US11228683B2 (en) Supporting conversations between customers and customer service agents
WO2022156655A1 (zh) 语音通话的控制方法、装置、计算机可读介质及电子设备
WO2023246264A1 (zh) 基于注意力模块的信息识别方法和相关装置
CN107783650A (zh) 一种基于虚拟机器人的人机交互方法及装置
US12028302B2 (en) Assistance during audio and video calls
US20240007817A1 (en) Real-time low-complexity stereo speech enhancement with spatial cue preservation
US20240161764A1 (en) Accent personalization for speakers and listeners
US20230245658A1 (en) Asynchronous pipeline for artificial intelligence service requests
US20230005202A1 (en) Speech image providing method and computing device for performing the same
CN118233588A (zh) 视频通话中表情包生成方法、设备及存储介质
CN116245974A (zh) 基于语音的绘制方法、装置、设备及存储介质
WO2022214616A1 (en) Personalizing audio-visual content based on user&#39;s interest
CA3143953A1 (en) Systems and methods for automating voice commands
CN117041223A (zh) 文档共享方法、装置、电子设备及存储介质
CN118116384A (zh) 一种语音识别的方法、设备以及存储介质
EP4150440A1 (en) Delivery of compatible supplementary content via a digital assistant
CN113868399A (zh) 一种服务器超卖实现方法及装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23825917

Country of ref document: EP

Kind code of ref document: A1