WO2024082891A1 - 一种数据处理方法及相关设备 - Google Patents

一种数据处理方法及相关设备 Download PDF

Info

Publication number
WO2024082891A1
WO2024082891A1 PCT/CN2023/119082 CN2023119082W WO2024082891A1 WO 2024082891 A1 WO2024082891 A1 WO 2024082891A1 CN 2023119082 W CN2023119082 W CN 2023119082W WO 2024082891 A1 WO2024082891 A1 WO 2024082891A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
recognition result
modal
probability
data
Prior art date
Application number
PCT/CN2023/119082
Other languages
English (en)
French (fr)
Inventor
傅奕飞
胡海林
朱铭健
陈醒濠
王云鹤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024082891A1 publication Critical patent/WO2024082891A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a data processing method and related equipment.
  • Artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • OCR optical character recognition
  • the application of using OCR technology to replace human labor to identify and process text information in images has become more and more widespread.
  • OCR technology is widely used in real-life scenarios such as document recognition, license plate recognition, advertising image text recognition, and bill recognition.
  • a language model is often used to correct the character information recognized by the visual model, and the correction result is used as the final recognition result of the character.
  • the correction result is highly dependent on the semantic information learned by the language model, which may cause the correct recognition result to be modified into an incorrect recognition result, that is, the above recognition method will have an over-correction problem.
  • the embodiments of the present application provide a data processing method and related devices for improving the accuracy of data character recognition.
  • the first aspect of the embodiment of the present application provides a data processing method, which is applied to text recognition/character recognition scenarios, and the method includes: obtaining input data, the input image is image data or audio data; extracting the first modal feature of the input data; obtaining the second modal feature based on the first modal feature, the first modal feature and the second modal feature are features of different modalities; the first modal feature is the visual feature of the image data or the audio feature of the audio data, and the second modal feature is the character feature; fusing the first modal feature and the second modal feature to obtain a target feature.
  • the target feature takes into account both the first modal feature and the second modal feature, so that the target feature has richer multi-modal information.
  • a first recognition result of the input data is obtained based on the target feature, and the first recognition result is used to indicate the characters contained in the input data.
  • the second modal feature is obtained according to the first modal feature of the input data, and the first modal feature and the second modal feature are fused to obtain the target feature, which can efficiently fuse the information of different modal data, so that the obtained target feature has the characteristics of multimodal data, and improves the expressive power of the target feature.
  • the accuracy of the first recognition result obtained according to the target feature is higher.
  • by reintroducing the first modal feature before correction the problem of over-correction of the second modal feature can be reduced.
  • the above step: obtaining the second modal feature based on the first modal feature includes: obtaining a second recognition result based on the first modal feature, the second recognition result being a character recognition result of the image data or a character recognition result of the audio data; obtaining the second modal feature based on the second recognition result.
  • the second modal feature is obtained by using a second recognition result related to the first modal feature, so that a partial correction of the first modal feature can be achieved.
  • the above steps extracting the first modal feature of the input data, including: inputting the input data into a first feature extraction module to obtain the first modal feature, the first feature extraction module is used to extract visual features or audio features; obtaining the second modal feature based on the second recognition result, including: inputting the second recognition result into a second feature extraction module to obtain the second modal feature, the second feature extraction module is used to extract character features.
  • the second extracted features can be used to correct the first modal features recognized by the visual module.
  • the above steps further include: obtaining a target recognition result of the input data based on the second recognition result and the first recognition result, and using the target recognition result as the recognition result of the character in the input data.
  • the target recognition result is used as the final recognition result of the character in the input data.
  • the original result (i.e., the second recognition result) obtained by considering the first modal feature and the corrected result (i.e., the first recognition result) obtained by the second modal feature are simultaneously considered.
  • the strong correction ability of the language module i.e., the module for obtaining the second modal feature
  • the strong recognition ability of the visual module i.e., the module for obtaining the first modal feature
  • the above step: obtaining a target recognition result of the input data based on the second recognition result and the first recognition result includes: obtaining a first probability and a second probability, the first probability being the probability of each character in the first recognition result, and the second probability being the probability of each character in the second recognition result; determining the target recognition result based on the first probability and the second probability.
  • the accuracy of recognizing each character is improved by integrating the first probability of each character in the first recognition result and the second probability of each character in the second recognition result, while taking into account the probability of each character in the initial modal corresponding result and the probability of each character in the corrected result.
  • the above step of: determining the target recognition result based on the first probability and the second probability includes: adding the first probability and the second probability corresponding to the characters at the same position in the first recognition result and the second recognition result; and determining the target recognition result based on the added probability.
  • the addition may be direct addition or weighted addition, which is not specifically limited here.
  • the probability of each character in the initial modality correspondence result and the probability of each character in the correction result are added, and the target recognition result is obtained based on the added probability, thereby improving the accuracy of the target recognition result.
  • the above step: fusing the first modal feature with the second modal feature to obtain the target feature includes: fusing the first modal feature with the second modal feature of characters at the same position to obtain the target feature.
  • the target feature has information of different modalities, thereby improving the expressiveness of the target feature.
  • the above-mentioned steps: obtaining a first recognition result of the input data based on the target feature include: determining the correspondence between the target feature and multiple characters; obtaining a set of arrangement methods of multiple characters, the arrangement method set including multiple arrangement methods; based on each arrangement method in the arrangement method set, performing maximum likelihood estimation on the last character under each arrangement method to obtain a first recognition result.
  • the input data is image data containing characters
  • the first modal feature is a visual feature
  • the second modal feature is a character feature
  • the method can be applied to character recognition or text recognition scenarios in images, such as identification/automatic entry of certificate information and bill information, auxiliary reading scenarios for the disabled, and filtering of banned words.
  • the input data is audio data
  • the first modal feature is an audio feature
  • the second modal feature is a character feature
  • the method can be applied to character recognition or text recognition scenarios in audio, such as auxiliary learning scenarios for the deaf and dumb.
  • a second aspect of an embodiment of the present application provides a data processing device, which is applied to a text recognition/character recognition scenario, and the data processing device includes: an acquisition unit, used to acquire input data, where the input image is image data or audio data; an extraction unit, used to extract a first modal feature of the input data; the acquisition unit, further used to acquire a second modal feature based on the first modal feature, where the first modal feature and the second modal feature are features of different modalities; the first modal feature is a visual feature of the image data or an audio feature of the audio data, and the second modal feature is a character feature; a fusion unit, used to fuse the first modal feature and the second modal feature to obtain a target feature; the acquisition unit, further used to acquire a first recognition result of the input data based on the target feature, where the first recognition result is used to indicate the characters contained in the input data.
  • the above-mentioned acquisition unit is specifically used to acquire a second recognition result based on the first modal feature, and the second recognition result is a character recognition result of the image data or a character recognition result of the audio data; the acquisition unit is specifically used to acquire the second modal feature based on the second recognition result.
  • the above-mentioned extraction unit is specifically used to input the input data into a first feature extraction module to obtain a first modal feature, and the first feature extraction module is used to extract visual features or audio features; the acquisition unit is specifically used to input the second recognition result into a second feature extraction module to obtain a second modal feature, and the second feature extraction module is used to extract character features.
  • the acquisition unit is further used to acquire a target recognition result of the input data based on the second recognition result and the first recognition result, and the target recognition result is used as the recognition result of the character in the input data.
  • the target recognition result is used as the final recognition result of the character in the input data.
  • the above-mentioned acquisition unit is specifically used to obtain a first probability and a second probability, the first probability being the probability of each character in the first recognition result, and the second probability being the probability of each character in the second recognition result; the acquisition unit is specifically used to determine the target recognition result based on the first probability and the second probability.
  • the above-mentioned acquisition unit is specifically used to add the first probability and the second probability corresponding to the characters in the same position in the first recognition result and the second recognition result; the acquisition unit is specifically used to determine the target recognition result based on the added probability.
  • the above-mentioned fusion unit is specifically used to fuse the first modal features and the second modal features of characters at the same position to obtain the target features.
  • the above-mentioned acquisition unit is specifically used to determine the correspondence between the target feature and multiple characters; the acquisition unit is specifically used to obtain a set of arrangement methods of multiple characters, and the arrangement method set includes multiple arrangement methods; the acquisition unit is specifically used to perform maximum likelihood estimation on the last character under each arrangement method based on each arrangement method in the arrangement method set to obtain a first recognition result.
  • the input data is image data containing characters
  • the first modal feature is a visual feature
  • the second modal feature is a character feature
  • the input data is audio data
  • the first modal feature is an audio feature
  • the second modal feature is a character feature
  • a third aspect of an embodiment of the present application provides a data processing device, including: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, when the program or instructions are executed by the processor, the data processing device implements the method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • a fourth aspect of an embodiment of the present application provides a computer-readable medium having a computer program or instructions stored thereon.
  • the computer program or instructions When executed on a computer, the computer executes the method in the aforementioned first aspect or any possible implementation of the first aspect.
  • a fifth aspect of an embodiment of the present application provides a computer program product.
  • the computer program product When the computer program product is executed on a computer, it enables the computer to execute the method in the aforementioned first aspect or any possible implementation manner of the first aspect.
  • the technical effects brought about by the second, third, fourth, fifth aspects or any possible implementation methods thereof can refer to the technical effects brought about by the first aspect or different possible implementation methods of the first aspect, and will not be repeated here.
  • the second modal feature is obtained according to the first modal feature of the input data (which can be understood as the correction process of the first modal feature), and the first modal feature and the second modal feature are fused to obtain the target feature, so that the accuracy of the first recognition result obtained according to the target feature is higher.
  • the characteristics of the two modalities i.e., the first modal feature and the second modal feature
  • the two modalities at the same time in the process of character recognition of the input data. Since different modalities have different ways of expression and different perspectives on things, there are some cross-over/complementary phenomena, and there may even be a variety of different information interactions between modalities.
  • FIG1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG2A is a schematic diagram of a bill recognition scenario provided in an embodiment of the present application.
  • FIG2B is a schematic diagram of a document recognition scenario provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of the system architecture provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a chip hardware structure provided in an embodiment of the present application.
  • FIG5 is a flow chart of a data processing method provided in an embodiment of the present application.
  • FIG6 is an example diagram of input data provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a training method and a reasoning method of a correction module provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of a neural network provided in an embodiment of the present application.
  • FIG9 is another flow chart of a data processing method provided in an embodiment of the present application.
  • FIG10 is another schematic diagram of a neural network provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a processing flow of a probability fusion module provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of a structure of a data processing device provided in an embodiment of the present application.
  • FIG. 13 is another schematic diagram of the structure of the data processing device provided in an embodiment of the present application.
  • a neural network may be composed of neural units, and a neural unit may refer to an operation unit with Xs and an intercept b as input, and the output of the operation unit may be:
  • n is a natural number greater than 1
  • Ws is the weight of Xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a Relu function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • space is used here because the classified object is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • the vector W determines the spatial transformation from the input space to the output space mentioned above, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of vectors W). Therefore, the training process of a neural network is essentially about learning how to control spatial transformations, or more specifically, learning the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • Convolutional neural networks contain a feature extractor consisting of a convolution layer and a subsampling layer.
  • the feature extractor can be regarded as a filter, and the convolution process can be regarded as convolving the same trainable filter with an input image or convolution feature plane (feature map).
  • a convolution layer refers to a neuron layer in a convolutional neural network that performs convolution processing on the input signal.
  • a neuron can only be connected to some neurons in the adjacent layer.
  • a convolution layer usually contains several feature planes, each of which can be composed of some rectangularly arranged neural units.
  • the neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of position.
  • the implicit principle is that the statistical information of a part of the image is the same as that of other parts. This means that the image information learned in a part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image.
  • multiple convolution kernels can be used to extract different image information. Generally speaking, the more convolution kernels there are, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of shared weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the transformer structure is a feature extraction network that includes an encoder and a decoder (similar to a convolutional neural network).
  • Encoder Performs feature learning in the global receptive field through self-attention, such as pixel features.
  • Decoder Learn the features of the required modules, such as the features of the output box, through self-attention and cross-attention.
  • attention also called attention mechanism
  • the attention mechanism can quickly extract important features of sparse data.
  • the attention mechanism occurs between the encoder and the decoder, or between the input sentence and the generated sentence.
  • the self-attention mechanism in the self-attention model occurs within the input sequence or the output sequence, and can extract the connection between words that are far apart in the same sentence, such as syntactic features (phrase structure).
  • the self-attention mechanism provides an effective modeling method to capture global context information through QKV. Assume that the input is Q (query), and the context is stored in the form of key-value pairs (K, V). Then the attention mechanism is actually a mapping function from query to a series of key-value pairs (key, value). The essence of the attention function can be described as a mapping from a query to a series of (key-value) pairs.
  • Attention essentially assigns a weight coefficient to each element in the sequence, which can also be understood as soft addressing. If each element in the sequence is stored in the form of (K, V), then attention completes the addressing by calculating the similarity between Q and K. The similarity calculated between Q and K reflects the importance of the extracted V value, that is, the weight, and then the weighted sum is used to obtain the final feature value.
  • the calculation of attention is mainly divided into three steps.
  • the first step is to calculate the similarity between the query and each key to obtain the weight.
  • Commonly used similarity functions include dot product, concatenation, perceptron, etc.
  • the second step is generally to use a softmax function (on the one hand, it can be normalized to obtain a probability distribution where the sum of all weight coefficients is 1.
  • the characteristics of the softmax function can be used to highlight the weights of important elements) to normalize these weights; finally, the weight and the corresponding key value are weighted and summed to obtain the final eigenvalue.
  • the specific calculation formula can be as follows:
  • d is the dimension of the QK matrix.
  • attention includes self-attention and cross-attention.
  • Self-attention can be understood as a special attention, that is, the input of QKV is consistent.
  • the input of QKV in cross-attention is inconsistent.
  • Attention uses the similarity between features (such as inner product) as weight to integrate the queried features as the update value of the current feature.
  • Self-attention is the attention extracted based on the attention of the feature map itself.
  • the setting of the convolution kernel limits the size of the receptive field, resulting in the network often needing multiple layers of stacking to focus on the entire feature map.
  • the advantage of self-attention is that its attention is global, and it can obtain the global spatial information of the feature map through simple query and assignment.
  • the special point of self-attention in the query key value (QKV) model is that the input corresponding to QKV is consistent. The QKV model will be described later.
  • a multilayer perceptron also known as a multilayer perceptron, is a feedforward artificial neural network model that maps inputs to a single output.
  • modality refers to the way things happen or exist.
  • each source or form of information can be called a modality.
  • the current research field mainly deals with the processing of modalities such as images, texts, and voices.
  • the modalities mentioned above can also be understood as “senses", that is, the channels through which organisms receive information through sensory organs and experience.
  • senses that is, the channels through which organisms receive information through sensory organs and experience.
  • humans have vision, hearing, touch, taste, smell and other modalities.
  • Multimodality can be understood as the fusion of multiple senses.
  • humans can communicate with smart devices through multiple channels such as sound, body language, information carriers (such as text, pictures, audio, video, etc.), and environment. Smart devices make judgments about human intentions after integrating multimodal information, and provide feedback to humans through text, sound, light strips and other methods.
  • the application scenario is shown in FIG1 , and the scenario includes: a terminal device 101 and a server 102.
  • the terminal device 101 and the server 102 can be connected to each other through a communication network, and the network can be a local area network or a wide area network transferred through a relay device.
  • Various clients can be installed in the terminal device 101.
  • the client of the terminal device 101 and the server 102 establish a communication connection through the communication network, the client of the terminal device 101 can send the data to be processed to the server 102, and the server 102 performs AI processing (for example: recognition, classification, etc.) on the data to be processed to obtain the processing result, and then sends the processing result to the client of the terminal device 101.
  • AI processing for example: recognition, classification, etc.
  • the communication network for communication connection between the terminal device 101 and the server 102 is a local area network
  • the communication network can be a short-range communication network such as a wireless fidelity (wifi) hotspot network, a Bluetooth (BT) network, or a near field communication (NFC) network.
  • a wireless fidelity (wifi) hotspot network such as a wireless fidelity (wifi) hotspot network, a Bluetooth (BT) network, or a near field communication (NFC) network.
  • BT Bluetooth
  • NFC near field communication
  • the communication network for communication connection between the terminal device 101 and the server 102 is a wide area network
  • the communication network can be a third-generation mobile communication technology (3rd-generation mobile communication technology, 3G) network, a fourth-generation mobile communication technology (4G) network, a fifth-generation mobile communication technology (5G) network, a future evolved public land mobile network (public land mobile network, PLMN) or the Internet, etc.
  • 3G third-generation mobile communication technology
  • 4G fourth-generation mobile communication technology
  • 5G fifth-generation mobile communication technology
  • PLMN public land mobile network
  • the above-mentioned terminal device 101 can be a mobile phone, a tablet computer (pad), a portable game console, a personal digital assistant (PDA), a laptop computer, an ultra mobile personal computer (UMPC), a handheld computer, a netbook, a vehicle-mounted media player, a wearable electronic device, a virtual reality (VR) terminal device, an augmented reality (AR), a vehicle, a vehicle-mounted terminal, an aircraft terminal, an intelligent robot and other terminal devices.
  • PDA personal digital assistant
  • UMPC ultra mobile personal computer
  • a handheld computer a netbook
  • a vehicle-mounted media player a wearable electronic device
  • VR virtual reality
  • AR augmented reality
  • vehicle a vehicle-mounted terminal
  • aircraft terminal an intelligent robot and other terminal devices.
  • the server 102 may be a cloud server, a network server, an application server, a management server, or other device or server capable of processing computer vision tasks.
  • the computer vision tasks include at least one or more of the following: identification, classification, and the like.
  • the scenario shown in FIG. 1 above can be understood as a cloud interaction scenario, and the data processing method in this scenario can be provided to users in the form of cloud services such as software as a service (SaaS) or function as a service (FaaS).
  • a server for processing computer vision tasks can be deployed to a public cloud to provide an externally published cloud service that is used to classify images and then perform character recognition on the images.
  • the uploaded data such as images can also be protected, for example, the images can be encrypted.
  • the server for processing computer vision tasks can also be deployed to a private cloud, thereby providing a cloud service for internal use.
  • the server for processing computer vision tasks can also be deployed to a hybrid cloud.
  • a hybrid cloud refers to an architecture that includes at least one public cloud and at least one private cloud.
  • the cloud service may provide an application programming interface (API) and/or a user interface (also referred to as a user interface).
  • the user interface may be a graphical user interface (GUI) or a command user interface (CUI).
  • GUI graphical user interface
  • CUI command user interface
  • the service caller may directly call the API provided by the cloud service to process data, such as classifying images.
  • the cloud service may also receive images submitted by users through the GUI or CUI, classify the images, and return the classification results.
  • the data processing method provided in the embodiment of the present application can be provided to the user in a packaged software package. Specifically, after the user purchases the software package, the user can install and use it in the user's operating environment.
  • the above software package can also be pre-installed on a computing device for data processing.
  • the terminal device can receive instructions from the user.
  • the terminal device can obtain image data input/selected by the user, and then initiate a request to the server so that the server performs data processing applications (for example, computer vision tasks such as classification, segmentation, detection, and image generation) on the image data obtained by the terminal device, thereby obtaining processing results corresponding to the image data.
  • data processing applications for example, computer vision tasks such as classification, segmentation, detection, and image generation
  • the terminal device can obtain an image input by the user, and then initiate a character (or text) recognition request to the server so that the server performs character recognition on the image, thereby obtaining a character recognition result of the image. And send the character recognition result to the terminal device. Then the terminal device can display the character recognition result of the image for the user to view and use.
  • the steps performed by the server in Figure 1 can also be migrated to the terminal device for implementation. That is, the terminal device can receive instructions from the user. For example, the terminal device can obtain image data input/selected by the user, and then perform data processing applications on the image data (for example, computer vision tasks such as classification, segmentation, detection, and image generation), thereby obtaining processing results corresponding to the image data.
  • the terminal device can obtain an image input by the user, and then perform character recognition on the image to obtain a character recognition result of the image. The character recognition result of the image is displayed for the user to view and use.
  • the above application scenario may specifically be an optical character recognition (OCR) scenario.
  • OCR optical character recognition
  • the scenario includes at least one or more of the following: identification/automatic entry of certificate information (or card information) and bill information, auxiliary reading scenarios for the disabled, or scenarios for filtering banned words.
  • the input data is image data/document
  • the computer vision task is a classification task.
  • the terminal device 101 can send the image data/document to the server 102, and the server 102 classifies and identifies the image data/document to obtain a classification result.
  • the classification result includes a category label of the image data/document, and the category label is used to characterize the category of the image data/document.
  • the category may include categories such as cards, tickets, labels, mails, or files.
  • the category of the image data/document can be further divided into subcategories, such as cards can be divided into subcategories such as work cards, bank cards, passes, and driver's licenses, and tickets can include subcategories such as shopping receipts and taxi tickets.
  • the classification result can also include the confidence that the image data/document belongs to the corresponding category.
  • the confidence is a probability value determined based on experience and used to characterize the degree of credibility.
  • the confidence can be a value in the range of [0,1]. The closer the value is to 1, the higher the degree of credibility is, and the closer the value is to 0, the lower the degree of credibility is.
  • Example 1 the bill recognition scenario is shown in Figure 2A.
  • the terminal device obtains the bill image taken or scanned by the user, and the terminal device performs OCR text recognition on the bill image to obtain a recognition result (for example, date, company, amount, etc.). And information statistics/reimbursement and other processing are performed based on the recognition result.
  • the terminal device obtains the bill image taken or scanned by the user, and sends the bill image to the server, and the server performs OCR text recognition on the bill image to obtain a recognition result (for example, date, company, amount, etc.). And the recognition result is sent to the terminal device, so that the user can use the recognition result to perform information statistics/reimbursement and other processing.
  • Example 2B the bill recognition scenario is shown in Figure 2B.
  • the terminal device obtains the ID image taken or scanned by the user, and the terminal device performs OCR text recognition on the ID image to obtain a recognition result (for example, name, address, contact number, date, etc.). And identity verification and other processing are performed based on the recognition result.
  • the terminal device obtains the ID image taken or scanned by the user, and sends the ID image to the server.
  • the server performs OCR text recognition on the ID image to obtain a recognition result (for example, name, address, contact number, date). And the The recognition result is sent to the terminal device so that the user can use it for identity verification and other processing.
  • OCR technology is widely used in real-life scenarios such as document recognition, license plate recognition, advertising image text recognition, and bill recognition.
  • a language model is often used to correct the character information after the visual model is recognized, and the correction result is used as the final recognition result of the character.
  • the correction result is highly dependent on the semantic information learned by the language model, which may cause the correct recognition result to be modified into an incorrect recognition result, that is, the above recognition method will have an over-correction problem. Therefore, how to solve the over-correction of the language model in text recognition is a technical problem that needs to be solved urgently.
  • the embodiments of the present application provide a data processing method and related equipment, which takes into account the characteristics of two modes (i.e., the first modal characteristics and the second modal characteristics) at the same time during the character recognition of the input data. Since different modalities are expressed in different ways, the perspectives of looking at things will also be different, so there are some cross-over/complementary phenomena, and there may even be a variety of different information interactions between the modalities. By reasonably processing the characteristics of the two modalities, rich target features can be obtained, thereby improving the recognition accuracy. And compared to the method of determining the recognition result only based on the corrected second modal characteristics, by reintroducing the first modal characteristics before correction, the problem of over-correction of the second modal characteristics can be reduced.
  • an embodiment of the present application provides a system architecture 300.
  • the data acquisition device 360 is used to collect training data.
  • the training data includes: audio samples or image samples containing characters, etc.
  • the training data is stored in the database 330, and the training device 320 obtains the target model/rule 301 based on the training data maintained in the database 330.
  • the target model/rule 301 can be used to implement the computer vision task applied by the data processing method provided in the embodiment of the present application.
  • the computer vision task may include: recognition, classification and other tasks.
  • the target model/rule 301 in the embodiment of the present application may specifically include at least one or more of the following: CNN, transformer, MLP, etc.
  • the training data maintained in the database 330 may not all come from the collection of the data acquisition device 360, but may also be received from other devices.
  • the training device 320 does not necessarily train the target model/rule 301 entirely based on the training data maintained by the database 330, and it is also possible to obtain training data from the cloud or other places for model training.
  • the above description should not be regarded as a limitation on the embodiments of the present application.
  • the target model/rule 301 obtained by training the training device 320 can be applied to different systems or devices, such as the execution device 310 shown in FIG. 3 .
  • the execution device 310 can be a terminal, such as a mobile terminal, a tablet computer, a laptop computer, an augmented reality (AR) device/virtual reality (VR) device, a vehicle terminal, etc.
  • AR augmented reality
  • VR virtual reality
  • the execution device 310 can also be a server or a cloud.
  • the execution device 310 is configured with an I/O interface 312 for data interaction with an external device.
  • the user can input data to the I/O interface 312 through the client device 340.
  • the input data can include: image data, audio data, etc. in the embodiment of the present application.
  • the input data can be input by the user, or uploaded by the user through a shooting device, and of course it can also come from a database, which is not limited here.
  • the preprocessing module 313 is used to preprocess the input data received by the I/O interface 312.
  • the preprocessing module 313 can be used to split the input data to obtain a sub-data set. For example, if the input data is image data, the preprocessing module 313 is used to split the image to obtain multiple image blocks.
  • the execution device 310 When the execution device 310 preprocesses the input data, or when the computing module 311 of the execution device 310 performs calculations and other related processing, the execution device 310 can call the data, code, etc. in the data storage system 350 for corresponding processing, and can also store the data, instructions, etc. obtained from the corresponding processing into the data storage system 350.
  • the I/O interface 312 returns the processing result, such as the result corresponding to the above-mentioned computer vision task, to the client device 340, thereby providing it to the user.
  • the training device 320 can generate corresponding target models/rules 301 based on different training data for different goals or different tasks.
  • the corresponding target models/rules 301 can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.
  • the user can manually provide input data, which can be manually provided through the interface provided by the I/O interface 312.
  • the client device 340 can automatically send input data to the I/O interface 312. If the client device 340 needs to obtain the user's authorization to automatically send the input data, the user can set the corresponding authority in the client device 340.
  • the user can view the results output by the execution device 310 on the client device 340.
  • the specific presentation form can be display, sound, action and other specific methods.
  • the client device 340 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 312 and the output results of the output I/O interface 312 as new sample data, and store them in the database 330.
  • the I/O interface 312 directly stores the input data of the input I/O interface 312 and the output results of the output I/O interface 312 as new sample data in the database 330.
  • FIG3 is only a schematic diagram of a system architecture provided in an embodiment of the present application.
  • the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 350 is an external memory relative to the execution device 310. In other cases, the data storage system 350 can also be placed in the execution device 310.
  • a target model/rule 301 is obtained through training with a training device 320 .
  • the target model/rule 301 in the embodiment of the present application may specifically be a target neural network.
  • the terminal device in the scenario shown in FIG. 1 above may specifically be the client device 340 or the execution device 310 in FIG. 3 , wherein the data storage system 350 may store the data to be processed of the execution device 310 , and the data storage system 350 may be integrated on the execution device 310 , or may be set on the cloud or other network servers.
  • FIG4 is a chip hardware structure provided in an embodiment of the present application, and the chip includes a neural network processor 40.
  • the chip can be set in the execution device 310 shown in FIG3 to complete the calculation work of the calculation module 311.
  • the chip can also be set in the training device 320 shown in FIG3 to complete the training work of the training device 320 and output the target model/rule 301.
  • the neural network processor 40 can be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU) and any other processor suitable for large-scale XOR operation processing.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processing unit
  • NPU neural network processor
  • the neural network processor 40 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 403, and the controller 404 controls the operation circuit 403 to extract data from the memory (weight memory or input memory) and perform operations.
  • the operation circuit 403 includes multiple processing units (process engines, PEs) inside.
  • the operation circuit 403 is a two-dimensional systolic array.
  • the operation circuit 403 can also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the operation circuit 403 is a general-purpose matrix processor.
  • the operation circuit 403 takes the corresponding data of the matrix B from the weight memory 402 and caches it on each PE in the operation circuit.
  • the operation circuit takes the matrix A data from the input memory 401 and performs a matrix operation with the matrix B, and the partial result or the final result of the matrix is stored in the accumulator 408.
  • the vector calculation unit 407 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 407 can be used for network calculations of non-convolutional/non-FC layers in a neural network, such as pooling, batch normalization, local response normalization, etc.
  • the vector calculation unit 407 stores the vector of processed outputs to the unified memory 406.
  • the vector calculation unit 407 can apply a nonlinear function to the output of the operation circuit 403, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 407 generates a normalized value, a merged value, or both.
  • the vector of processed outputs can be used as an activation input to the operation circuit 403, such as for use in a subsequent layer in a neural network.
  • the unified memory 406 is used to store input data and output data.
  • the weight data is directly transferred from the external memory to the input memory 401 and/or the unified memory 406 through the direct memory access controller 405 (DMAC), the weight data in the external memory is stored in the weight memory 402, and the data in the unified memory 406 is stored in the external memory.
  • DMAC direct memory access controller 405
  • the bus interface unit (BIU) 410 is used to implement the interaction between the main CPU, DMAC and instruction fetch memory 409 through the bus.
  • An instruction fetch buffer 409 connected to the controller 404 is used to store instructions used by the controller 404.
  • the controller 404 is used to call the instructions cached in the instruction fetch memory 409 to control the working process of the computing accelerator.
  • the unified memory 406, the input memory 401, the weight memory 402 and the instruction fetch memory 409 are all on-chip memories, and the external memory is a memory outside the NPU, which can be a double data rate synchronous dynamic random access memory (DDR SDRAM for short), a high bandwidth memory (HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • HBM high bandwidth memory
  • the data processing method provided by the embodiment of the present application is described below.
  • the method can be performed by a data processing device, or by a component of the data processing device (such as a processor, a chip, or a chip system, etc.).
  • the data processing device can be a server or terminal device in the aforementioned Figures 1 to 2B.
  • the method can also be performed by a system consisting of a server and a terminal device (as shown in the aforementioned Figure 1).
  • the method can be processed by a CPU in a data processing device, or by a CPU and a GPU, or by using other processors suitable for neural network calculations without using a GPU, and the present application is not limited thereto.
  • the data involved in the embodiment of the present application can refer to text, images, audio, video, etc. For the convenience of description, this article only takes the data as an example for exemplary description.
  • Fig. 5 is a flowchart of a data processing method provided in an embodiment of the present application.
  • the method may include steps 501 to 505. Steps 501 to 505 are described in detail below.
  • Step 501 obtaining input data.
  • the data processing device to obtain input data, which can be through collection/photography, or by receiving data sent by other devices, or by selecting from a database, etc., which are not limited here.
  • the input data is only described as image data containing characters.
  • the input data may also be audio data, video data, etc., which are not limited here.
  • Characters may also be understood as text (e.g., Chinese, English, etc.).
  • the method can be applied to character recognition or text recognition scenarios in images, such as identification/automatic entry of certificate information and bill information, auxiliary reading scenarios for the disabled, and filtering of banned words.
  • the method can be applied to character recognition or text recognition scenarios in audio, such as auxiliary learning scenarios for the deaf and mute.
  • the input data may be as shown in FIG. 6 .
  • Step 502 extract the first modal feature of the input data.
  • the data processing device After the data processing device acquires the input data, it can extract the first modal features of the input data.
  • the data processing device inputs the input data into a first feature extraction module to obtain a first modal feature.
  • the first feature extraction module may include a transformer encoder, a convolution layer/pooling layer of a CNN, or an MLP, etc.
  • the specific structure of the first feature extraction module may be set according to actual needs and is not limited here.
  • the first modal feature is related to the modality of the input data. If the input data is image data, the first feature extraction module is used to extract the visual features of the data, that is, the first modal feature is a visual feature (or a visual feature vector). If the input data is audio data, the first feature extraction module is used to extract the audio features of the data, that is, the first modal feature is an audio feature.
  • Step 503 Acquire the second modal feature based on the first modal feature.
  • the second modal feature can be obtained based on the first modal feature.
  • the second modal feature is a character feature, and the first modal feature and the second modal feature are features of different modalities.
  • the description of the modality can refer to the explanation in the aforementioned related terms, which will not be repeated here.
  • the data processing device obtains a second recognition result based on the first modal feature.
  • the second recognition result can also be understood as a preliminary recognition result of the characters in the input data.
  • the second recognition result is input into a second extraction module to obtain a second modal feature.
  • the second feature extraction module is used to extract character features (or character feature vectors) of characters.
  • the second recognition result can be understood as a preliminary classification result.
  • the second feature extraction module is similar to the first feature extraction module, and can be a transformer encoder, a convolutional layer/pooling layer, an MLP, etc.
  • the second feature extraction module is often a transformer encoder.
  • the data processing device inputs the first modal feature into a classification module to obtain a second recognition result, and the classification module corresponds to the first feature extraction module.
  • the classification module can be a decoder.
  • the second recognition result is “GAFE”.
  • Step 504 fuse the first modal feature and the second modal feature to obtain the target feature.
  • the data processing device After the data processing device obtains the second modal feature, it can fuse the first modal feature and the second modal feature to obtain the target feature. This step can efficiently fuse the information of different modal data, so that the obtained target feature has the characteristics of multimodal data, and improve the expression ability of the target feature.
  • the first modal feature and the second modal feature of the characters at the same position are fused to obtain the target feature.
  • the data processing device can input the first modal feature and the second modal feature into a feature fusion module for alignment and fusion to obtain the target feature.
  • the fusion can be vector addition, weighted summation, etc., which is not limited here.
  • the feature fusion module is used to fuse different modal features corresponding to characters at the same position.
  • the fusion layer is a transformer structure, etc.
  • E i represents the feature vector of the i-th character after fusion, represents the first modal feature of the i-th character (e.g., visual feature vector), Represents the second modal feature of the i-th character (e.g., character embedding vector), where i is a positive integer.
  • the above formula 1 is only an example of obtaining the target feature. In practical applications, there may be other forms.
  • the first modal feature and the second modal feature are multiplied by different coefficients and then summed to obtain the target feature, etc. The specifics are not limited here.
  • first modal feature and the second modal feature may be transformed first and then added/weighted summed, etc., so as to improve the accuracy of subsequent character recognition based on the target feature.
  • Step 505 Obtain a first recognition result of the input data based on the target feature.
  • the data processing device After the data processing device obtains the target feature, it obtains a first recognition result of the input data based on the target feature.
  • the first recognition result may also be referred to as a correction result.
  • the correspondence between the target feature and the plurality of characters is determined, and a set of arrangement modes of the plurality of characters is obtained, the set of arrangement modes including a plurality of arrangement modes, and then based on each arrangement mode in the set of arrangement modes, a maximum likelihood estimation is performed on the last character in each arrangement mode to obtain a first recognition result.
  • the first recognition result is "CAFE”. It can be seen that the first recognition result "CAFE" obtained for the target feature is more accurate than the second recognition result "GAFE”.
  • the above process can be understood as cyclically sorting the arrangement of multiple characters to obtain an arrangement set.
  • the last character is used as the character to be predicted.
  • the last character is predicted by the previous characters. More context information can be used through the arrangement set.
  • the data processing device inputs the target feature into a correction module to obtain a first recognition result.
  • the correction module may be a decoder, a fully connected layer, a convolutional layer, etc.
  • the process of the correction module processing the target feature may be as shown in the following formula 2 and formula 3.
  • T is the length of the text/character
  • Z T represents a set of permutations of length T
  • Z represents a permutation sampled from Z T
  • represents the model parameters of the correction module
  • x represents the target feature
  • Z t represents the Z permutation.
  • the t-th character in , Z ⁇ t means the first t-1 characters in the Z arrangement.
  • Pi (y) represents the predicted probability that the i-th character is y
  • exp represents the exponential with base e
  • e(y) represents the embedding vector of the i-th character
  • g(x) is used to identify the arrangement method.
  • exp(e(y) T g(x)) represents the weight of the i-th character being y
  • y is any character in the character set
  • y′ is all characters in the character set
  • ⁇ y' exp(e(y') T g(x)) represents the sum of the weights of each character in the character set.
  • the character set can be understood as a preset character set or an offline character set.
  • the correction module can randomly sort the training text during training and use the autoregressive method to predict the context characters.
  • the correction module regards the current predicted character as the last character in the sorting when predicting each character.
  • Different context information for example, left to right and right to left
  • Figure 7 Take the training process as an example with four rows, and a circle represents a character.
  • the first row is “white circle, gray circle, gray circle, gray circle”
  • the second row is “white circle, white circle, white circle”
  • the third row is “white circle, gray circle, white circle, white circle”
  • the fourth row is "white circle, gray circle, gray circle, white circle”.
  • the white circle represents the information that the character cannot see
  • the gray circle represents the information that the character can see.
  • the first row indicates that the first character can see the information of the second character to the fourth character.
  • the arrangement set includes four arrangements, namely: "1-2-3-4", “2-3-4-1", “3-4-1-2", and "4-1-2-3".
  • the fourth character is inferred to be E, using "2-3-4-1", the first character is inferred to be C, using "3-4-1-2”, the second character is inferred to be A, and using "4-1-2-3", the third character is inferred to be F.
  • the characteristics of two modes i.e., the first modal characteristics and the second modal characteristics
  • the perspectives of looking at things will also be different, so there are some cross-over/complementary phenomena, and there may even be a variety of different information interactions between the modes.
  • rich target features can be obtained, thereby improving the recognition accuracy.
  • the problem of over-correction of the second modal characteristics can be reduced.
  • the neural network includes a first feature extraction module, a classification module, a second feature extraction module, a feature fusion module and a correction module.
  • the input data is input into the first feature extraction module to obtain the first modal feature
  • the first modal feature is input into the classification module to obtain the second recognition result.
  • the second recognition result is input into the second feature extraction module to obtain the second modal feature.
  • the first modal feature and the second modal feature are input into the feature fusion module to obtain the target feature.
  • the target feature is input into the correction module to obtain the first recognition result.
  • the first feature extraction module and the classification module shown in Figure 8 can be understood as submodules of the visual model.
  • the first feature extraction module and the classification module shown in Figure 8 can be understood as submodules of the audio model.
  • the embodiment of the present application further provides a data processing method, as shown in Figure 9, the method may include steps 901 to 906. Steps 901 to 906 are described in detail below.
  • Step 901 obtaining input data.
  • Step 902 extract the first modal feature of the input data.
  • Step 903 Acquire the second modal feature based on the first modal feature.
  • Step 904 fuse the first modal feature and the second modal feature to obtain the target feature.
  • Step 905 Obtain a first recognition result of the input data based on the target feature.
  • Steps 901 to 905 in this embodiment are similar to steps 501 to 505 in the embodiment shown in FIG. 5 , and are not described again here.
  • Step 906 obtaining a target recognition result based on the first recognition result and the second recognition result.
  • the target recognition result is used as the final recognition result of the characters in the input data.
  • the data processing device After the data processing device obtains the first recognition result and the second recognition result, it obtains a target recognition result based on the first recognition result and the second recognition result, and uses the target recognition result as the character recognition result of the input data.
  • the data processing device first obtains a first probability and a second probability, wherein the first probability is the probability of each character in the first recognition result, and the second probability is the probability of each character in the second recognition result, and then determines the target recognition result based on the first probability and the second probability.
  • the data processing device adds the first probability and the second probability corresponding to the characters at the same position in the first recognition result and the second recognition result (for example, directly adding them or adding them after weighting them respectively, etc.), and then determines the target recognition result based on the added probabilities.
  • the characters at the same position can also be understood as characters with the same position index.
  • the first recognition result and the second recognition result are input into a probability fusion module to obtain a target recognition result.
  • the probability fusion module can also be called a probability residual structure.
  • the processing process of the probability fusion module can be shown as the following formula 4.
  • yi represents the target recognition result of the i-th character
  • Pi0 represents the first probability of the i-th character
  • Pi represents the second probability of the i-th character. It means selecting the character with probability greater than the threshold or the highest probability from the character pool as the output.
  • the neural network involved in this embodiment can be shown in Figure 10.
  • the neural network also includes the above-mentioned probability fusion module. Among them, the same modules in the neural network shown in Figure 10 and the neural network shown in Figure 8 are not repeated here.
  • the data processing device can input the first recognition result and the second recognition result into the probability fusion module to obtain the target recognition result.
  • the process of step 906 can be shown in FIG. 11. That is, the first recognition result is "CAFE" and the second recognition result is "GAFE".
  • the probability of the first character being C is obtained by adding the probabilities of the first character in the two recognition results
  • the probability of the second character being A is obtained by adding the probabilities of the second character in the two recognition results.
  • the probability of the third character being F is obtained by adding the probabilities of the third character in the two recognition results.
  • the probability of the fourth character being E is obtained by adding the probabilities of the fourth character in the two recognition results.
  • the target recognition result thus obtained is "CAFE".
  • the characters of the first recognition result and the second recognition result may be aligned, and then the probabilities may be added.
  • the error rate of the first recognition result output by the correction module can be reduced.
  • the correction module there are multiple possible correction results. For example, taking caxe as an example, assuming that the third character needs to be corrected, there are possibilities such as cafe/cake/cage. If the output results of the visual module can be used as a reference, the above correction results can be improved.
  • the method for recognizing the result can reduce the over-correction problem of the second modal feature by reintroducing the first modal feature before correction.
  • the probability residual structure can be used to add the original result output by the visual module and the probability of the corrected result output by the language module (or correction module, text module), thereby combining the advantages of the strong correction ability of the language module and the strong recognition ability of the visual module, thereby improving the overall recognition ability of the neural network for characters.
  • the data sets include: IIIT, SVT, IC13, SVTP, IC15, CUTE, OOV-ST.
  • V+L- ⁇ is equivalent to the method of the embodiment shown in the aforementioned FIG. 9.
  • V+L- ⁇ is equivalent to the method of the embodiment shown in the aforementioned FIG. 5.
  • the average accuracy of multiple samples of V+L- ⁇ on each data set is greater than the average accuracy of multiple samples of V+L- ⁇ on each data set, that is, probability addition can improve the overall accuracy of character recognition.
  • the data processing method or neural network provided in the embodiments of the present application can improve the recognition accuracy of text/characters.
  • An embodiment of the data processing device in the embodiment of the present application includes:
  • An acquisition unit 1201 is used to acquire input data, where the input data is image data or audio data;
  • An extraction unit 1202 configured to extract a first modal feature of the input data
  • the acquisition unit 1201 is further configured to acquire a second modal feature based on the first modal feature, wherein the first modal feature and the second modal feature are features of different modalities; the first modal feature is a visual feature of image data or an audio feature of audio data, and the second modal feature is a character feature;
  • a fusion unit 1203, configured to fuse the first modal feature with the second modal feature to obtain a target feature
  • the acquisition unit 1201 is further used to acquire a first recognition result of the input data based on the target feature, where the first recognition result is used to indicate characters contained in the input data.
  • the acquisition unit 1201 is further configured to acquire a target recognition result of the input data based on the second recognition result and the first recognition result.
  • the target recognition result is used as a recognition result of the character in the input data.
  • the target recognition result is used as a final recognition result of the character in the input data.
  • the characteristics of two modes i.e., the first modal characteristics and the second modal characteristics
  • the angles of looking at things will also be different, so there are some cross/complementary phenomena, and there may even be a variety of different information interactions between the modes.
  • rich target features can be obtained, thereby improving the recognition accuracy.
  • the problem of over-correction of the second modal characteristics can be reduced.
  • the acquisition unit 1201 adds the original result output by the visual module to the probability of the correction result output by the language module (or correction module, text module), realizing the advantages of combining the strong correction ability of the language module and the strong recognition ability of the visual module, thereby improving the overall recognition ability of the neural network for characters.
  • the data processing device may include a processor 1301, a memory 1302, and a communication port 1303.
  • the processor 1301, the memory 1302, and the communication port 1303 are interconnected via a line.
  • the memory 1302 stores program instructions and data.
  • the memory 1302 stores program instructions and data corresponding to the steps executed by the data processing device in the corresponding implementation modes shown in the aforementioned FIGS. 1 to 11 .
  • the processor 1301 is used to execute the steps performed by the data processing device shown in any of the embodiments shown in Figures 1 to 11 above.
  • the communication port 1303 can be used to receive and send data, and to execute the steps related to acquisition, sending, and receiving in any of the embodiments shown in FIG. 1 to FIG. 11 .
  • the data processing device may include more or fewer components than those in FIG. 13 , and this application is merely an illustrative description and is not intended to be limiting.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be a separate object.
  • the above-mentioned integrated unit can be implemented in the form of hardware or in the form of software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Discrimination (AREA)

Abstract

本申请实施例公开了一种数据处理方法,该方法应用于文本识别/字符识别场景,该方法包括:获取输入数据,该输入数据为图像数据或音频数据,并根据输入数据的第一模态特征获取第二模态特征,第一模态特征为图像数据的视觉特征或者音频数据的音频特征,第二模态特征为字符特征;再融合第一模态特征与第二模态特征以得到目标特征,可以高效融合不同模态数据的信息,使得获取的目标特征具有多模态数据的特性,提高目标特征的表达能力。从而根据该目标特征获取的第一识别结果的精度更高。且相较于只根据纠正后的第二模态特征确定识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。

Description

一种数据处理方法及相关设备
本申请要求于2022年10月20日提交中国专利局、申请号为202211289351.X、发明名称为“一种数据处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
随着光学字符识别(optical character recognition,OCR)技术的快速发展,利用OCR技术代替人力进行识别和处理图像中的文字信息的应用变得越来越广泛。OCR技术被广泛应用于证件识别、车牌识别,广告图片文本识别和票据识别等现实场景。为了避免视觉遮挡等不良因素对识别内容造成干扰,常常使用语言模型对视觉模型识别后的字符信息进行纠正,并将纠正结果作为字符的最终识别结果。然而,纠正结果高度依赖于语言模型学习到的语义信息,可能会导致将正确的识别结果修改为错误的识别结果,即上述识别方式会出现过度纠偏问题。
因此,如何解决文字识别中语言模型的过度纠偏是亟待解决的技术问题。
发明内容
本申请实施例提供了一种数据处理方法及相关设备,用于提升数据字符识别的准确性。
本申请实施例第一方面提供了一种数据处理方法,该方法应用于文本识别/字符识别场景,该方法包括:获取输入数据,该输入图像为图像数据或音频数据;提取输入数据的第一模态特征;基于第一模态特征获取第二模态特征,第一模态特征与第二模态特征为不同模态的特征;第一模态特征为图像数据的视觉特征或者音频数据的音频特征,第二模态特征为字符特征;融合第一模态特征与第二模态特征以得到目标特征。该目标特征同时考虑到第一模态特征与第二模态特征,使得目标特征具有更丰富的多种模态信息。基于目标特征获取输入数据的第一识别结果,第一识别结果用于指示输入数据中含有的字符。
本申请实施例中,根据输入数据的第一模态特征获取第二模态特征,并融合第一模态特征与第二模态特征以得到目标特征,可以高效融合不同模态数据的信息,使得获取的目标特征具有多模态数据的特性,提高目标特征的表达能力。从而根据该目标特征获取的第一识别结果的精度更高。且相较于只根据纠正后的第二模态特征确定识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一模态特征获取第二模态特征,包括:基于第一模态特征获取第二识别结果,第二识别结果为图像数据的字符识别结果或音频数据的字符识别结果;基于第二识别结果获取第二模态特征。
该种可能的实现方式中,通过与第一模态特征相关的第二识别结果获取第二模态特征,可以实现对第一模态特征的部分纠正。
可选地,在第一方面的一种可能的实现方式中,上述步骤:提取输入数据的第一模态特征,包括:将输入数据输入第一特征提取模块以得到第一模态特征,第一特征提取模块用于提取视觉特征或音频特征;基于第二识别结果获取第二模态特征,包括:将第二识别结果输入第二特征提取模块以得到第二模态特征,第二特征提取模块用于提取字符特征。
该种可能的实现方式中,以第一特征提取模块用于提取视觉特征为例,为了减少视觉遮挡等不良因 素对识别内容造成干扰,可以使用第二提取特征对视觉模块识别到的第一模态特征进行纠正。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于第二识别结果与第一识别结果获取输入数据的目标识别结果,该目标识别结果作为输入数据中字符的识别结果。或者理解为,将目标识别结果作为输入数据中字符的最终识别结果。
该种可能的实现方式中,通过同时考虑第一模态特征得到的原始结果(即第二识别结果)与第二模态特征得到的纠正结果(即第一识别结果)。尤其是对于图像识别来说。可以实现结合语言模块(即获取第二模态特征的模块)的纠正能力强以及视觉模块(即获取第一模态特征的模块)识别能力强的优点,从而提高图像中字符的识别能力。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第二识别结果与第一识别结果获取输入数据的目标识别结果,包括:获取第一概率与第二概率,第一概率为第一识别结果中各字符的概率,第二概率为第二识别结果中各字符的概率;基于第一概率与第二概率确定目标识别结果。
该种可能的实现方式中,通过融合各字符在第一识别结果中的第一概率以及各字符在第二识别结果中的第二概率,同时考虑到初始模态对应结果中各字符的概率以及纠正结果中各字符的概率,从而提升识别各字符的准确率。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一概率与第二概率确定目标识别结果,包括:将第一识别结果与第二识别结果中相同位置字符对应的第一概率与第二概率相加;基于相加后的概率确定目标识别结果。其中,相加可以是直接相加,也可以是加权后再相加等,具体此处不做限定。
该种可能的实现方式中,通过初始模态对应结果中各字符的概率以及纠正结果中各字符的概率相加,并基于相加后的概率获取目标识别结果,从而提升目标识别结果的准确率。
可选地,在第一方面的一种可能的实现方式中,上述步骤:融合第一模态特征与第二模态特征以得到目标特征,包括:将相同位置字符的第一模态特征与第二模态特征融合以得到目标特征。
该种可能的实现方式中,通过将相同位置字符的不同模态特征进行融合,使得目标特征具有不同模态的信息,从而提升目标特征的表达能力。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于目标特征获取输入数据的第一识别结果,包括:确定目标特征与多个字符的对应关系;获取多个字符的排列方式集合,排列方式集合包括多种排列方式;基于排列方式集合中的每个排列方式对每个排列方式下的最后一个字符进行最大似然估计,以得到第一识别结果。
该种可能的实现方式中,通过将排列方式集合中每个排列方式下的最后一个字符作为预测字符进行最大似然估计,可以基于不同的排列方式学习到不同的上下文信息(例如,左向右与右向左),从而提升第一识别结果的准确率。
可选地,在第一方面的一种可能的实现方式中,上述的输入数据为含有字符的图像数据,第一模态特征为视觉特征,第二模态特征为字符特征。
该种可能的实现方式中,该方法可以应用于图像中的字符识别或文字识别场景。例如,证件信息、票据信息的识别/自动录入场景、残疾人的辅助阅读场景、违禁词的过滤场景等。
可选地,在第一方面的一种可能的实现方式中,上述的输入数据为音频数据,第一模态特征为音频特征,第二模态特征为字符特征。
该种可能的实现方式中,该方法可以应用于音频中的字符识别或文字识别场景。例如,聋哑人的辅助学习场景等。
本申请实施例第二方面提供了一种数据处理设备,数据处理设备应用于文本识别/字符识别场景,数据处理设备包括:获取单元,用于获取输入数据,该输入图像为图像数据或音频数据;提取单元,用于提取输入数据的第一模态特征;获取单元,还用于基于第一模态特征获取第二模态特征,第一模态特征与第二模态特征为不同模态的特征;第一模态特征为图像数据的视觉特征或者音频数据的音频特征,第二模态特征为字符特征;融合单元,用于融合第一模态特征与第二模态特征以得到目标特征;获取单元,还用于基于目标特征获取输入数据的第一识别结果,第一识别结果用于指示输入数据中含有的字符。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于基于第一模态特征获取第二识别结果,第二识别结果为图像数据的字符识别结果或音频数据的字符识别结果;获取单元,具体用于基于第二识别结果获取第二模态特征。
可选地,在第二方面的一种可能的实现方式中,上述的提取单元,具体用于将输入数据输入第一特征提取模块以得到第一模态特征,第一特征提取模块用于提取视觉特征或音频特征;获取单元,具体用于将第二识别结果输入第二特征提取模块以得到第二模态特征,第二特征提取模块用于提取字符特征。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,还用于基于第二识别结果与第一识别结果获取输入数据的目标识别结果,该目标识别结果作为输入数据中字符的识别结果。或者理解为,将目标识别结果作为输入数据中字符的最终识别结果。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于获取第一概率与第二概率,第一概率为第一识别结果中各字符的概率,第二概率为第二识别结果中各字符的概率;获取单元,具体用于基于第一概率与第二概率确定目标识别结果。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于将第一识别结果与第二识别结果中相同位置字符对应的第一概率与第二概率相加;获取单元,具体用于基于相加后的概率确定目标识别结果。
可选地,在第二方面的一种可能的实现方式中,上述的融合单元,具体用于将相同位置字符的第一模态特征与第二模态特征融合以得到目标特征。
可选地,在第二方面的一种可能的实现方式中,上述的获取单元,具体用于确定目标特征与多个字符的对应关系;获取单元,具体用于获取多个字符的排列方式集合,排列方式集合包括多种排列方式;获取单元,具体用于基于排列方式集合中的每个排列方式对每个排列方式下的最后一个字符进行最大似然估计,以得到第一识别结果。
可选地,在第二方面的一种可能的实现方式中,上述的输入数据为含有字符的图像数据,第一模态特征为视觉特征,第二模态特征为字符特征。
可选地,在第二方面的一种可能的实现方式中,上述的输入数据为音频数据,第一模态特征为音频特征,第二模态特征为字符特征。
本申请实施例第三方面提供了一种数据处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该数据处理设备实现上述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第四方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第五方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
其中,第二、第三、第四、第五方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请具有以下优点:根据输入数据的第一模态特征获取第二模态特征(可以理解为第一模态特征的纠正过程),并融合第一模态特征与第二模态特征以得到目标特征,从而根据该目标特征获取的第一识别结果的精度更高。通过在对输入数据进行字符识别的过程中,同时考虑到两个模态的特征(即第一模态特征与第二模态特征)。由于不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉/互补的现象,甚至模态间可能还存在多种不同的信息交互,通过合理的处理两个模态的特征,可以得到丰富的目标特征,从而可以提高识别精度。且相较于只根据纠正后的第二模态特征确定识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需 要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种应用场景示意图;
图2A为本申请实施例提供的票据识别场景示意图;
图2B为本申请实施例提供的证件识别场景示意图;
图3为本申请实施例提供的系统架构的结构示意图;
图4为本申请实施例提供的一种芯片硬件结构示意图;
图5为本申请实施例提供的数据处理方法一个流程示意图;
图6为本申请实施例提供的输入数据的一种示例图;
图7为本申请实施例提供的纠正模块的训练方式与推理方式的示意图;
图8为本申请实施例提供的神经网络的一种示意图;
图9为本申请实施例提供的数据处理方法另一个流程示意图;
图10为本申请实施例提供的神经网络的另一种示意图;
图11为本申请实施例提供的概率融合模块的处理流程示意图;
图12为本申请实施例提供的数据处理设备的一个结构示意图;
图13为本申请实施例提供的数据处理设备的另一个结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以Xs和截距b为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为Xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是Relu函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
神经网络中的每一层的工作可以用数学表达式y=a(Wx+b)来描述:从物理层面神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由Wx完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
2、卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷 积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
3、transformer
transformer结构是一种包含编码器与解码器的特征提取网络(类别于卷积神经网络)。
编码器:通过自注意力的方式在全局感受野下进行特征学习,例如像素点的特征。
解码器:通过自注意力与交叉注意力来学习所需模块的特征,例如输出框的特征。
下面对注意力(也可以称为注意力机制)进行描述:
注意力机制可以快速提取稀疏数据的重要特征。注意力机制是发生在编码器和解码器之间,也可以说是发生在输入句子和生成句子之间。而自注意力模型中的自注意力机制则发生在输入序列内部,或者输出序列内部,可以抽取到同一个句子内间隔较远的单词之间的联系,比如句法特征(短语结构)。自注意力机制通过QKV提供了一种有效的捕捉全局上下文信息的建模方式。假定输入为Q(query),以键值对(K,V)形式存储上下文。那么注意力机制其实是query到一系列键值对(key,value)上的映射函数。attention函数的本质可以被描述为一个查询(query)到一系列(键key-值value)对的映射。attention本质上是为序列中每个元素都分配一个权重系数,这也可以理解为软寻址。如果序列中每一个元素都以(K,V)形式存储,那么attention则通过计算Q和K的相似度来完成寻址。Q和K计算出来的相似度反映了取出来的V值的重要程度,即权重,然后加权求和就得到最后的特征值。
注意力的计算主要分为三步,第一步是将query和每个key进行相似度计算得到权重,常用的相似度函数有点积,拼接,感知机等;然后第二步一般是使用一个softmax函数(一方面可以进行归一化,得到所有权重系数之和为1的概率分布。另一方面可以用softmax函数的特性突出重要元素的权重)对这些权重进行归一化;最后将权重和相应的键值value进行加权求和得到最后的特征值。具体计算公式可以如下:
其中,d为QK矩阵的维度。
另外,注意力包括自注意力与交叉注意力,自注意可以理解为是特殊的注意力,即QKV的输入一致。而交叉注意力中的QKV的输入不一致。注意力是利用特征之间的相似程度(例如内积)作为权重来集成被查询特征作为当前特征的更新值。自注意力是基于特征图本身的关注而提取的注意力。
对于卷积而言,卷积核的设置限制了感受野的大小,导致网络往往需要多层的堆叠才能关注到整个特征图。而自注意的优势就是它的关注是全局的,它能通过简单的查询与赋值就能获取到特征图的全局空间信息。自注意力在查询、键、值(query key value,QKV)模型中的特殊点在于QKV对应的输入是一致的。后续会对QKV模型进行描述。
4、多层感知器(multilayer perceptron,MLP)
多层感知器,也可以称为多层感知机,是一种前馈人工神经网络模型,其将输入映射到单一的输出的上。
5、损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
6、模态
一般来说,模态是指事物发生或存在的方式。或者说,对每一种信息的来源或者形式,都可以称为一种模态(Modality),目前研究领域中主要是对图像,文本,语音等模态的处理。
上述中的模态也可以理解为是“感官”,即生物凭借感知器官与经验来接收信息的通道,例如:人类有视觉、听觉、触觉、味觉和嗅觉等等模态。多模态可以理解为是多种感官进行融合,例如,人类可以通过声音、肢体语言、信息载体(例如文字、图片、音频、视频等)、环境等多个通道与智能设备进行交流,智能设备融合多模态信息后作出对人类的意图判断,并通过文字、声音、灯带等多种方式反馈给人类。
因为不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉(所以存在信息冗余),互补(所以比单特征更优秀)的现象,甚至模态间可能还存在多种不同的信息交互,如果能合理的处理多模态信息,就能得到丰富特征信息。
接下来对本申请实施例提供的数据处理方法的应用场景进行描述。
该应用场景如图1所示,该场景包括:终端设备101与服务器102。其中,终端设备101和服务器102可通过通信网络进行通信连接,该网络可以为局域网、也可以是通过中继(relay)设备转接的广域网等。终端设备101中可安装有各种客户端。终端设备101的客户端和服务器102之间通过通信网络建立通信连接后,终端设备101的客户端可以将待处理数据发送给服务器102,由服务器102对待处理数据进行AI处理(例如:识别、分类等)得到处理结果,再将处理结果发送给终端设备101的客户端。
当终端设备101和服务器102之间进行通信连接的通信网络为局域网时,示例性的,该通信网络可以是无线保真(wireless fidelity,wifi)热点网络、蓝牙(bluetooth,BT)网络或近距离无线通信技术(near field communication,NFC)网络等近距离通信网络。
当终端设备101和服务器102之间进行通信连接的通信网络为广域网时,示例性的,该通信网络可以是第三代移动通信技术(3rd-g ene ra tion wi reless tele phone technology,3G)网络、第四代移动通信技术(the 4th generation mobile communication technology,4G)网络、第五代移动通信技术(5th-generation mobile communication technology,5G)网络、未来演进的公共陆地移动网络(public land mobile network,PLMN)或因特网等。
上述的终端设备101可以是手机、平板电脑(pad)、便携式游戏机、掌上电脑(personal digital assistant,PDA)、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、车载媒体播放设备、可穿戴电子设备、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)、车辆、车载终端、飞机终端、智能机器人等终端设备。
上述的服务器102可以是云服务器、网络服务器、应用服务器以及管理服务器等具有处理计算机视觉任务的设备或服务器。该计算机视觉任务包括以下至少一项或多项:识别、分类等。
可选地,上述图1所示的场景可以理解为是云端交互场景,该场景下的数据处理方法可以以云服务如软件即服务(software as a service,SaaS)或者功能即服务(function as a service,FaaS)的形式提供给用户使用。例如,用于处理计算机视觉任务的服务器可以部署到公有云,从而提供一项对外发布的云服务,该云服务用于对图像进行分类,然后将图像进行字符识别。当数据处理方法作为一项服务对 外发布时,考虑到安全性,还可以对上传数据如图像进行保护,例如可以对图像进行加密处理。在一些实施例中,用于处理计算机视觉任务的服务器也可以部署到私有云,从而提供一项云服务对内使用。当然,用于处理计算机视觉任务的服务器也可以部署到混合云。其中,混合云是指包括至少一个公有云和至少一个私有云的架构。
在一种可能实现的方式中,当数据处理方法以云服务的形式提供给用户使用时,该云服务可以提供应用程序编程接口(application programming interface,API)和/或用户界面(也称作用户接口)。其中,用户界面可以是图形用户界面(graphical user interface)或者是命令用户界面(command user interface,CUI)。如此,服务调用方可以直接调用该云服务提供的API进行数据处理,例如对图像进行分类,当然,云服务也可以接收用户通过GUI或CUI提交的图像,并对图像进行分类,返回分类结果。
在另一种可能实现的方式中,本申请实施例提供的数据处理方法可以以封装好的软件包提供给用户使用。具体地,用户购买软件包后可以在该用户的运行环境下安装使用。当然,上述软件包也可以预安装在计算设备,以用于数据处理。
可以理解的是,上述图1所示的场景为云端交互的场景。即终端设备可以接收用户的指令,例如终端设备可以获取用户输入/选择的图像数据,然后向服务器发起请求,使得服务器针对终端设备得到的图像数据执行数据处理应用(例如,分类、分割、检测、图像生成等的计算机视觉任务),从而得到针对图像数据对应的处理结果。示例性的,终端设备可以获取用户输入的图像,然后向服务器发起字符(或称为文本)识别请求,使得服务器对该图像进行字符识别,从而得到图像的字符识别结果。并向终端设备发送该字符识别结果。进而终端设备可以显示图像的字符识别结果,以供用户观看和使用。
在实际应用中,若终端设备的算力足够处理计算机视觉任务,也可以将图1中服务器执行的步骤迁移到终端设备中实现。即终端设备可以接收用户的指令,例如终端设备可以获取用户输入/选择的图像数据,然后对图像数据执行数据处理应用(例如,分类、分割、检测、图像生成等的计算机视觉任务),从而得到针对图像数据对应的处理结果。示例性的,终端设备可以获取用户输入的图像,然后对该图像进行字符识别,从而得到图像的字符识别结果。并显示图像的字符识别结果,以供用户观看和使用。
可选地,上述的应用场景具体可以是光学字符识别(optical character recognition,OCR)场景。例如该场景包括以下至少一项或多项:证件信息(或称为卡证信息)、票据信息的识别/自动录入场景,残疾人的辅助阅读场景,或者是应用于违禁词的过滤场景等。
示例性的,以输入数据为图像数据/文档,计算机视觉任务是分类任务为例。终端设备101可以向服务器102发送图像数据/文档,服务器102对图像数据/文档进行分类识别得到分类结果。该分类结果包括图像数据/文档的类别标签,该类别标签用于表征图像数据/文档的类别。具体地,该类别可以包括卡证、票据、标签、邮件或者文件等类别。在一些可能的实现方式中,图像数据/文档的类别还可以进一步分为子类别,如卡证可以分为工卡、银行卡、通行证、驾驶证等子类别,票据可以包括购物小票、打车票等子类别。在一些实施例中,分类结果还可以包括图像数据/文档属于对应类别的置信度。其中,置信度是根据经验确定的、用于表征可信程度的概率值。置信度可以是取值范围为[0,1]的数值,该数值越接近1,表明可信程度越高,该数值越接近0,表明可信程度越低。
示例1,票据识别场景如图2A所示。在一种可能实现的方式中,终端设备获取用户拍照或扫描后的票据图像,由终端设备对该票据图像进行OCR文字识别得到识别结果(例如,日期、公司、金额等)。并根据该识别结果进行信息统计/报销等处理。在另一种可能实现的方式中,终端设备获取用户拍照或扫描后的票据图像之后,终端设备获取用户拍照或扫描后的票据图像,并向服务器发送该票据图像,服务器对该票据图像进行OCR文字识别得到识别结果(例如,日期、公司、金额等)。并将该识别结果发给终端设备,从而用户可以使用该识别结果进行信息统计/报销等处理。
示例2,票据识别场景如图2B所示。在一种可能实现的方式中,终端设备获取用户拍照或扫描后的证件图像,由终端设备对该证件图像进行OCR文字识别得到识别结果(例如,姓名、住址、联系电话、日期等)。并根据该识别结果进行身份核验等处理。在另一种可能实现的方式中,终端设备获取用户拍照或扫描后的证件图像之后,终端设备获取用户拍照或扫描后的证件图像,并向服务器发送该证件图像,服务器对该证件图像进行OCR文字识别得到识别结果(例如,姓名、住址、联系电话、日期)。并将该 识别结果发给终端设备,从而用户可以使用该识别结果进行身份核验等处理。
随着OCR技术的快速发展,利用OCR技术代替人力进行识别和处理图像中的文字信息的应用变得越来越广泛。OCR技术被广泛应用于证件识别、车牌识别,广告图片文本识别和票据识别等现实场景。为了避免视觉遮挡等不良因素对识别内容造成干扰,常常使用语言模型对视觉模型识别后的字符信息进行纠正,并将纠正结果作为字符的最终识别结果。然而,纠正结果高度依赖于语言模型学习到的语义信息,可能会导致将正确的识别结果修改为错误的识别结果,即上述识别方式会出现过度纠偏问题。因此,如何解决文字识别中语言模型的过度纠偏是亟待解决的技术问题。
为了解决上述问题,本申请实施例提供一种数据处理方法及相关设备,通过在对输入数据进行字符识别的过程中,同时考虑到两个模态的特征(即第一模态特征与第二模态特征)。由于不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉/互补的现象,甚至模态间可能还存在多种不同的信息交互,通过合理的处理两个模态的特征,可以得到丰富的目标特征,从而可以提高识别精度。且相较于只根据纠正后的第二模态特征确定识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。
下面介绍本申请实施例提供的系统架构。
参见附图3,本申请实施例提供了一种系统架构300。如系统架构300所示,数据采集设备360用于采集训练数据,本申请实施例中训练数据包括:音频样本或含有字符的图像样本等。并将训练数据存入数据库330,训练设备320基于数据库330中维护的训练数据训练得到目标模型/规则301。下面将更详细地描述训练设备320如何基于训练数据得到目标模型/规则301,该目标模型/规则301能够用于实现本申请实施例提供的数据处理方法所应用的计算机视觉任务。该计算机视觉任务可以包括:识别、分类等任务。本申请实施例中的目标模型/规则301具体可以包括以下至少一项或多项:CNN、transformer、MLP等。需要说明的是,在实际的应用中,数据库330中维护的训练数据不一定都来自于数据采集设备360的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备320也不一定完全基于数据库330维护的训练数据进行目标模型/规则301的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备320训练得到的目标模型/规则301可以应用于不同的系统或设备中,如应用于图3所示的执行设备310,执行设备310可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)设备/虚拟现实(virtual reality,VR)设备,车载终端等。当然,执行设备310还可以是服务器或者云端等。在附图3中,执行设备310配置有I/O接口312,用于与外部设备进行数据交互,用户可以通过客户设备340向I/O接口312输入数据,输入数据在本申请实施例中可以包括:图像数据、音频数据等。另外该输入数据可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限定。
预处理模块313用于根据I/O接口312接收到的输入数据进行预处理,在本申请实施例中,预处理模块313可以用于对输入数据进行拆分得到子数据集合。例如:输入数据为图像数据,预处理模块313用于对图像进行拆分得到多个图像块。
在执行设备310对输入数据进行预处理,或者在执行设备310的计算模块311执行计算等相关的处理过程中,执行设备310可以调用数据存储系统350中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统350中。
最后,I/O接口312将处理结果,如得到的上述计算机视觉任务对应的结果返回给客户设备340,从而提供给用户。
值得说明的是,训练设备320可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则301,该相应的目标模型/规则301即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口312提供的界 面进行操作。另一种情况下,客户设备340可以自动地向I/O接口312发送输入数据,如果要求客户设备340自动发送输入数据需要获得用户的授权,则用户可以在客户设备340中设置相应权限。用户可以在客户设备340查看执行设备310输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备340也可以作为数据采集端,采集如图所示输入I/O接口312的输入数据及输出I/O接口312的输出结果作为新的样本数据,并存入数据库330。当然,也可以不经过客户设备340进行采集,而是由I/O接口312直接将如图所示输入I/O接口312的输入数据及输出I/O接口312的输出结果,作为新的样本数据存入数据库330。
值得注意的是,附图3仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图3中,数据存储系统350相对执行设备310是外部存储器,在其它情况下,也可以将数据存储系统350置于执行设备310中。
如图3所示,根据训练设备320训练得到目标模型/规则301,本申请实施例中的目标模型/规则301具体可以为目标神经网络。
上述图1所示场景中的终端设备具体可以是图3中的客户设备340或执行设备310,其中,数据存储系统350可以存储执行设备310的待处理数据,数据存储系统350可以集成在执行设备310上,也可以设置在云上或其它网络服务器上。
下面介绍本申请实施例提供的一种芯片硬件结构。
图4为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器40。该芯片可以被设置在如图3所示的执行设备310中,用以完成计算模块311的计算工作。该芯片也可以被设置在如图3所示的训练设备320中,用以完成训练设备320的训练工作并输出目标模型/规则301。
神经网络处理器40可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器40作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路403,控制器404控制运算电路403提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路403内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路403是二维脉动阵列。运算电路403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路403从权重存储器402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器401中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器408中。
向量计算单元407可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元407可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元407将经处理的输出的向量存储到统一存储器406。例如,向量计算单元407可以将非线性函数应用到运算电路403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路403的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器406用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器405(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器401和/或统一存储器406、将外部存储器中的权重数据存入权重存储器402,以及将统一存储器406中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)410,用于通过总线实现主CPU、DMAC和取指存储器409之间进行交互。
与控制器404连接的取指存储器(instruction fetch buffer)409,用于存储控制器404使用的指令。
控制器404,用于调用取指存储器409中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器406,输入存储器401,权重存储器402以及取指存储器409均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
下面对本申请实施例提供的数据处理方法进行描述。该方法可以由数据处理设备执行,也可以由数据处理设备的部件(例如处理器、芯片、或芯片系统等)执行。该数据处理设备可以是前述图1至图2B中的服务器或终端设备。当然,该方法也可以是由服务器和终端设备构成的系统执行(如前述图1所示)。可选地,该方法可以由数据处理设备中的CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。另外,本申请实施例所涉及的数据可以是指文本、图像、音频、视频等,为了方便描述,本文仅以数据是图像为例进行示例性说明。
请参阅图5,本申请实施例提供的数据处理方法的一个流程示意图,该方法可以包括步骤501至步骤505。下面对步骤501至步骤505进行详细说明。
步骤501,获取输入数据。
本申请实施例中,数据处理设备获取输入数据的方式有多种方式,可以是通过采集/拍摄的方式,也可以是通过接收其他设备发送的方式,还可以是从数据库中选取的方式等,具体此处不做限定。
本申请实施例中仅以输入数据是含有字符的图像数据为例进行示例性描述,在实际应用中,该输入数据还可以是音频数据、视频数据等,具体此处不做限定。其中,字符也可以理解为是文字(例如,中文、英文等)。
例如,在输入数据是图像数据的情况下,该方法可以应用于图像中的字符识别或文字识别场景。例如,证件信息、票据信息的识别/自动录入场景、残疾人的辅助阅读场景、违禁词的过滤场景等。
又例如,在输入数据是音频数据的情况下,该方法可以应用于音频中的字符识别或文字识别场景。例如,聋哑人的辅助学习场景等。
示例性的,以输入数据为含有字符的图像数据为例,该输入数据可以如图6所示。
步骤502,提取输入数据的第一模态特征。
数据处理设备获取输入数据之后,可以提取输入数据的第一模态特征。
可选地,数据处理设备将输入数据输入第一特征提取模块以得到第一模态特征。该第一特征提取模块可以包括transformer的编码器,也可以包括CNN的卷积层/池化层,还可以是MLP等,第一特征提取模块的具体结构可以根据实际需要设置,此处不做限定。
另外,第一模态特征与输入数据的模态相关。若输入数据为图像数据,则第一特征提取模块用于提取数据的视觉特征,即第一模态特征为视觉特征(或者称为视觉特征向量)。若输入数据为音频数据,则第一特征提取模块用于提取数据的音频特征,即则第一模态特征为音频特征。
步骤503,基于第一模态特征获取第二模态特征。
数据处理设备获取第一模态特征之后,可以基于第一模态特征获取第二模态特征。该第二模态特征为字符特征,且第一模态特征与第二模态特征为不同模态的特征。其中,关于模态的描述可以参考前述相关术语中的解释,此处不再赘述。
可选地,数据处理设备基于第一模态特征获取第二识别结果。该第二识别结果也可以理解为是输入数据中字符的初步识别结果。并将第二识别结果输入第二提取模块以得到第二模态特征。该第二特征提取模块用于提取字符的字符特征(或者称为字符特征向量)。对于分类任务来说,第二识别结果可以理解为是初步的分类结果。其中,第二特征提取模块与第一特征提取模块类似,可以是transformer的编码器、卷积层/池化层、MLP等。对于文字识别(或称为字符识别)的场景,第二特征提取模块常常为transformer的编码器。
具体的,对于分类任务来说,数据处理设备将第一模态特征输入分类模块以得到第二识别结果,该分类模块与第一特征提取模块对应。例如,在第一特征提取模块为编码器的情况下,分类模块可以是解码器。
示例性的,延续前述图6的举例,第二识别结果为“GAFE”。
步骤504,融合第一模态特征与第二模态特征以得到目标特征。
数据处理设备获取第二模态特征之后,可以融合第一模态特征与第二模态特征以得到目标特征。该步骤可以高效融合不同模态数据的信息,使得获取的目标特征具有多模态数据的特性,提高目标特征的表达能力。
可选地,将相同位置字符的第一模态特征与第二模态特征融合以得到目标特征。具体的,数据处理设备可以将第一模态特征与第二模态特征输入特征融合模块进行对齐融合,以得到目标特征。其中,该融合可以是向量相加、加权求和等,具体此处不做限定。该特征融合模块用于将相同位置字符对应的不同模态特征进行融合。例如,该融合层是transformer结构等。
示例性的,上述过程如公式一所示:
公式一:
其中,Ei表示融合后第i个字符的特征向量,表示第i个字符的第一模态特征(例如视觉特征向量),表示第i个字符的第二模态特征(例如字符嵌入向量),i为正整数。
可以理解的是,上述公式一只是获取目标特征的一种举例,在实际应用中,还可以有其他形式,例如,第一模态特征与第二模态特征分别乘以不同的系数后进行求和以得到目标特征等,具体此处不做限定。
需要说明的是,若第一模态特征与第二模态特征的维度/长度不同时,也可以对第一模态特征和第二模态特征先进行特征变换再相加/加权求和等,从而提高后续基于目标特征进行字符识别的精度。
步骤505,基于目标特征获取输入数据的第一识别结果。
数据处理设备获取目标特征之后,基于该目标特征获取输入数据的第一识别结果。该第一识别结果也可以称为纠正结果。
可选地,确定目标特征与多个字符的对应关系。并获取多个字符的排列方式集合,该排列方式集合包括多种排列方式。再基于排列方式集合中的每个排列方式对每个排列方式下的最后一个字符进行最大似然估计,以得到第一识别结果。
示例性的,延续上述举例。第一识别结果为“CAFE”。可以看出,针对于目标特征获取的第一识别结果“CAFE”相较于第二识别结果“GAFE”更加准确。
上述过程可以理解为,将多个字符的排列方式进行循环排序,以得到排列方式集合。针对排列方式集合中的每一个排列组合,将最后一个字符作为待预测字符。通过前面的字符预测最后一个字符。通过排列方式集合可以利用更多的上下文信息。
具体的,对于分类任务来说,数据处理设备将目标特征输入纠正模块以得到第一识别结果,该纠正模块可以是解码器、全连接层、卷积层等。
示例性的,上述纠正模块处理目标特征的过程可以如下述公式二与公式三所示。
公式二:
其中,E表示期望,T为文本/字符的长度,ZT表示长度为T的排列方式集合,Z表示从ZT中采样得到的一种排列方式,θ表示纠正模块的模型参数,x表示目标特征,Zt表示Z排列方式 中的第t个字符,Z<t表示Z排列方式中的前t-1个字符。
公式三:
其中,Pi(y)表示第i个字符是y的预测概率,exp表示以e为底的指数,e(y)表示第i个字符的嵌入向量(embedding),g(x)用于标识排列方式。exp(e(y)Tg(x))表示第i个字符为y的权重,y为字符集中任意一个字符,y′为字符集中的所有字符,∑y'exp(e(y')Tg(x))表示字符集中每个字符的权重和。其中,字符集可以理解为预设的字符集或离线字符集。
可以理解的是,上述公式二与公式三只是获取第一识别结果的一种举例,在实际应用中,还可以有其他形式,具体此处不做限定。
进一步的,为了提升推理过程中预测字符的准确性,上述纠正模块在训练中可以对训练文本进行随机排序,采用自回归方法预测上下文字符。在推理过程中,纠正模块在预测每个字符时将当前预测字符视为排序中的最后一个字符。通过不同的排列方式学习到不同的上下文信息(例如,左向右与右向左),从而提升第一识别结果的准确率。具体流程可以如图7所示。以训练过程中有四行为例,以一圈表示一字符。第一行是“白圈、灰圈、灰圈、灰圈”,第二行是“白圈、白圈、白圈、白圈”,第三行是“白圈、灰圈、白圈、白圈”,第四行是“白圈、灰圈、灰圈、白圈”。其中,白圈表示该字符看不到的信息,灰圈表示该字符能看到的信息。例如,第一行表示表示第一个字符能看到第二个字符到第四个字符的信息。在推理过程中,排列方式集合包括4个排列方式,分别为:“1-2-3-4”、“2-3-4-1”、“3-4-1-2”、“4-1-2-3”。利用“1-2-3-4”推测出第4个字符为E,利用“2-3-4-1”推测出第1个字符为C,利用“3-4-1-2”推测出第2个字符为A,利用“4-1-2-3”推测出第3个字符为F。
本申请实施例中,通过在对输入数据进行字符识别的过程中,同时考虑到两个模态的特征(即第一模态特征与第二模态特征)。由于不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉/互补的现象,甚至模态间可能还存在多种不同的信息交互,通过合理的处理两个模态的特征,可以得到丰富的目标特征,从而可以提高识别精度。且相较于只根据纠正后的第二模态特征确定识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。
为了更直观理解图5所示实施例中各模块之间的关系,下面结合图8对本申请实施例涉及的神经网络进行描述。该神经网络包括第一特征提取模块、分类模块、第二特征提取模块、特征融合模块以及纠正模块。将输入数据输入第一特征提取模块以得到第一模态特征,将第一模态特征输入分类模块以得到第二识别结果。将第二识别结果输入第二特征提取模块以得到第二模态特征。将第一模态特征与第二模态特征输入特征融合模块以得到目标特征。将目标特征输入纠正模块以得到第一识别结果。关于各模块的结构可以参考前述描述,此处不再赘述。
例如,对于输入数据的图像数据的情况,图8所示的第一特征提取模块与分类模块可以理解为是视觉模型的子模块。对于输入数据的音频数据的情况,图8所示的第一特征提取模块与分类模块可以理解为是音频模型的子模块。
另外,为了充分利用两个模态的信息,本申请实施例还提供一种数据处理方法,如图9所示,该方法可以包括步骤901至步骤906。下面对步骤901至步骤906进行详细说明。
步骤901,获取输入数据。
步骤902,提取输入数据的第一模态特征。
步骤903,基于第一模态特征获取第二模态特征。
步骤904,融合第一模态特征与第二模态特征以得到目标特征。
步骤905,基于目标特征获取输入数据的第一识别结果。
本实施例中的步骤901至步骤905与前述图5所示实施例中的步骤501至步骤505类似,此处不再赘述。
步骤906,基于第一识别结果与第二识别结果获取目标识别结果。或者理解为,将目标识别结果作为输入数据中字符的最终识别结果。
数据处理设备获取第一识别结果与第二识别结果之后,基于第一识别结果与第二识别结果获取目标识别结果,将目标识别结果作为输入数据的字符识别结果。
可选地,数据处理设备先获取第一概率与第二概率,该第一概率为第一识别结果中各字符的概率,第二概率为第二识别结果中各字符的概率。再基于第一概率与第二概率确定目标识别结果。
具体的,数据处理设备将第一识别结果与第二识别结果中相同位置字符对应的第一概率与第二概率相加(例如,直接相加或各自加权后再相加等)。再基于相加后的概率确定目标识别结果。其中,该相同位置字符也可以理解为是相同位置索引的字符。
例如,将第一识别结果与第二识别结果输入概率融合模块以得到目标识别结果。该概率融合模块也可以称为概率残差结构。
示例性的,概率融合模块的处理过程可以如下述公式四所示。
公式四:
其中,yi表示第i个字符的目标识别结果,Pi 0表示第i字符的第一概率,Pi表示第i字符的第二概率,表示从字符池中选择概率大于阈值或概率最大的字符作为输出。
可以理解的是,上述公式四只是获取第一识别结果的一种举例,在实际应用中,还可以有其他形式,具体此处不做限定。
本实施例中涉及的神经网络可以如图10所示,该神经网络除了包括前述图8所示的神经网络的各个模块,还包括上述的概率融合模块。其中,图10所示神经网络与图8所示神经网络中相同模块此处不再赘述。与图8所示神经网络不同的是,数据处理设备可以将第一识别结果与第二识别结果输入概率融合模块以得到目标识别结果。
示例性的,以输入数据为前述图6所示的举例,步骤906的过程可以如图11所示。即第一识别结果为“CAFE”,第二识别结果为“GAFE”。两个识别结果中第1个字符的概率相加得到第1个字符为C的概率最大,两个识别结果中第2个字符的概率相加得到第2个字符为A的概率最大,两个识别结果中第3个字符的概率相加得到第3个字符为F的概率最大,两个识别结果中第4个字符的概率相加得到第4个字符为E的概率最大。从而得到的目标识别结果为“CAFE”。
可选地,在做概率相加之前,还可以将第一识别结果与第二识别结果的字符进行对齐,再对概率进行相加。
通过两个识别结果的概率相加,可以减少纠正模块输出第一识别结果的错误率。对于纠正模块来说,本身有多种可能的纠正结果。例如,以caxe为例,假设要纠正第三个字符,有cafe/cake/cage等可能。若可以借鉴视觉模块的输出结果,则可以提升上述的纠正结果。
本实施例中,一方面,通过在对输入数据进行字符识别的过程中,同时考虑到两个模态的特征(即第一模态特征与第二模态特征)。由于不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉/互补的现象,甚至模态间可能还存在多种不同的信息交互,通过合理的处理两个模态的特征,可以得到丰富的目标特征,从而可以提高识别精度。且相较于只根据纠正后的第二模态特征确定 识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。另一方面,通过概率残差结构可以将视觉模块输出的原始结果与语言模块(或称为纠正模块、文本模块)输出的纠正结果概率进行相加,实现结合语言模块的纠正能力强以及视觉模块识别能力强的优点,从而提高神经网络对于字符的总体识别能力。
为了直观看出本申请实施例提供的数据处理方法的有益效果,或者理解为本申请实施例提供的神经网络的有益效果。下面对比现有技术在不同数据集上的测试结果进行描述。例如,数据集包括:IIIT、SVT、IC13、SVTP、IC15、CUTE、OOV-ST。
测试结果如表1至表3所示:
表1
其中,先对上述表1中的英文缩写进行解释:概率相加(residual probability,RP),在词典内的数据精度(In Vocabulary,IV),在词典外的数据精度(Out of Vocabulary,OOV),Gap表示IV和OOV的差值,All表示总精度。V+L表示两个模态特征(例如视觉特征与字符特征)融合。
可以看出,模态融合+概率相加的方法(即V+L-√)的总精度大于模态融合但不进行概率相加的方法(即V+L-×),即概率相加可以提升识别字符的总体精度。其中,V+L-√相当于前述图9所示实施例的方法。V+L-×相当于前述图5所示实施例的方法。
表2
其中,先对上述表2中的英文缩写进行解释:regular表示正常文本,irregular表示弯曲文本。Fusion Module表示概率融合模块与纠正模块。Avg表示平均精度。
可以看出本申请实施例提供的神经网络在各个数据集上多个样本的平均精度相较于其他方法较高。
表3
可以看出,V+L-√在各个数据集上多个样本的平均精度大于V+L-×在各个数据集上多个样本的平均精度,即概率相加可以提升识别字符的总体精度。
综上可以看出本申请实施例提供的数据处理方法或神经网络可以提升文本/字符的识别精度。
上面对本申请实施例中的数据处理方法进行了描述,下面对本申请实施例中的数据处理设备进行描述,请参阅图12,本申请实施例中数据处理设备的一个实施例包括:
获取单元1201,用于获取输入数据,该输入图像为图像数据或音频数据;
提取单元1202,用于提取所述输入数据的第一模态特征;
所述获取单元1201,还用于基于所述第一模态特征获取第二模态特征,所述第一模态特征与所述第二模态特征为不同模态的特征;第一模态特征为图像数据的视觉特征或者音频数据的音频特征,第二模态特征为字符特征;
融合单元1203,用于融合所述第一模态特征与所述第二模态特征以得到目标特征;
所述获取单元1201,还用于基于所述目标特征获取所述输入数据的第一识别结果,所述第一识别结果用于指示所述输入数据中含有的字符。
可选地,获取单元1201,还用于基于所述第二识别结果与所述第一识别结果获取所述输入数据的目标识别结果。该目标识别结果作为输入数据中字符的识别结果。或者理解为,将目标识别结果作为输入数据中字符的最终识别结果。
本实施例中,数据处理设备中各单元所执行的操作与前述图1至图11所示实施例中描述的类似,此处不再赘述。
本实施例中,一方面,通过在对输入数据进行字符识别的过程中,同时考虑到两个模态的特征(即第一模态特征与第二模态特征)。由于不同模态的表现方式不一样,看待事物的角度也会不一样,所以存在一些交叉/互补的现象,甚至模态间可能还存在多种不同的信息交互,通过合理的处理两个模态的特征,可以得到丰富的目标特征,从而可以提高识别精度。且相较于只根据纠正后的第二模态特征确定识别结果的方法,通过再次引入纠正前的第一模态特征,可以减少第二模态特征的过度纠正问题。另一方面,获取单元1201将视觉模块输出的原始结果与语言模块(或称为纠正模块、文本模块)输出的纠正结果概率进行相加,实现结合语言模块的纠正能力强以及视觉模块识别能力强的优点,从而提高神经网络对于字符的总体识别能力。
参阅图13,本申请提供的另一种数据处理设备的结构示意图。该数据处理设备可以包括处理器1301、存储器1302和通信端口1303。该处理器1301、存储器1302和通信端口1303通过线路互联。其中,存储器1302中存储有程序指令和数据。
存储器1302中存储了前述图1至图11所示对应的实施方式中,由数据处理设备执行的步骤对应的程序指令以及数据。
处理器1301,用于执行前述图1至图11所示实施例中任一实施例所示的由数据处理设备执行的步骤。
通信端口1303可以用于进行数据的接收和发送,用于执行前述图1至图11所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,数据处理设备可以包括相对于图13更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物 理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (19)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    获取输入数据,所述输入数据为图像数据或音频数据;
    提取所述输入数据的第一模态特征;
    基于所述第一模态特征获取第二模态特征,所述第一模态特征与所述第二模态特征为不同模态的特征;所述第一模态特征为所述图像数据的视觉特征或者所述音频数据的音频特征,所述第二模态特征为字符特征;
    融合所述第一模态特征与所述第二模态特征以得到目标特征;
    基于所述目标特征获取所述输入数据的第一识别结果,所述第一识别结果用于指示所述输入数据中含有的字符。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述第一模态特征获取第二模态特征,包括:
    基于所述第一模态特征获取第二识别结果,所述第二识别结果为所述图像数据的字符识别结果或所述音频数据的字符识别结果;
    基于所述第二识别结果获取所述第二模态特征。
  3. 根据权利要求2所述的方法,其特征在于,所述提取所述输入数据的第一模态特征,包括:
    将所述输入数据输入第一特征提取模块以得到所述第一模态特征,所述第一特征提取模块用于提取所述视觉特征或所述音频特征;
    所述基于所述第二识别结果获取所述第二模态特征,包括:
    将所述第二识别结果输入第二特征提取模块以得到所述第二模态特征,所述第二特征提取模块用于提取所述字符特征。
  4. 根据权利要求2或3所述的方法,其特征在于,所述方法还包括:
    基于所述第二识别结果与所述第一识别结果获取所述输入数据的目标识别结果,所述目标识别结果作为所述输入数据中字符的识别结果。
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述第二识别结果与所述第一识别结果获取所述输入数据的目标识别结果,包括:
    获取第一概率与第二概率,所述第一概率为所述第一识别结果中各字符的概率,所述第二概率为所述第二识别结果中各字符的概率;
    基于所述第一概率与所述第二概率确定所述目标识别结果。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所述第一概率与所述第二概率确定所述目标识别结果,包括:
    将所述第一识别结果与所述第二识别结果中相同位置字符对应的第一概率与第二概率相加;
    基于相加后的概率确定所述目标识别结果。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述融合所述第一模态特征与所述第二模态特征以得到目标特征,包括:
    将相同位置字符的所述第一模态特征与所述第二模态特征融合以得到所述目标特征。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述基于所述目标特征获取所述输入数据的第一识别结果,包括:
    确定目标特征与多个字符的对应关系;
    获取多个字符的排列方式集合,所述排列方式集合包括多种排列方式;
    基于所述排列方式集合中的每个排列方式对所述每个排列方式下的最后一个字符进行最大似然估计,以得到所述第一识别结果。
  9. 一种数据处理设备,其特征在于,所述数据处理设备包括:
    获取单元,用于获取输入数据,所述输入数据为图像数据或音频数据;
    提取单元,用于提取所述输入数据的第一模态特征;
    所述获取单元,还用于基于所述第一模态特征获取第二模态特征,所述第一模态特征与所述第二模 态特征为不同模态的特征;所述第一模态特征为所述图像数据的视觉特征或者所述音频数据的音频特征,所述第二模态特征为字符特征;
    融合单元,用于融合所述第一模态特征与所述第二模态特征以得到目标特征;
    所述获取单元,还用于基于所述目标特征获取所述输入数据的第一识别结果,所述第一识别结果用于指示所述输入数据中含有的字符。
  10. 根据权利要求9所述的数据处理设备,其特征在于,所述获取单元,具体用于基于所述第一模态特征获取第二识别结果,所述第二识别结果为所述图像数据的字符识别结果或所述音频数据的字符识别结果;
    所述获取单元,具体用于基于所述第二识别结果获取所述第二模态特征。
  11. 根据权利要求10所述的数据处理设备,其特征在于,所述提取单元,具体用于将所述输入数据输入第一特征提取模块以得到所述第一模态特征,所述第一特征提取模块用于提取所述视觉特征或所述音频特征;
    所述获取单元,具体用于将所述第二识别结果输入第二特征提取模块以得到所述第二模态特征,所述第二特征提取模块用于提取所述字符特征。
  12. 根据权利要求10或11所述的数据处理设备,其特征在于,所述获取单元,还用于基于所述第二识别结果与所述第一识别结果获取所述输入数据的目标识别结果,所述目标识别结果作为所述输入数据中字符的识别结果。
  13. 根据权利要求12所述的数据处理设备,其特征在于,所述获取单元,具体用于获取第一概率与第二概率,所述第一概率为所述第一识别结果中各字符的概率,所述第二概率为所述第二识别结果中各字符的概率;
    所述获取单元,具体用于基于所述第一概率与所述第二概率确定所述目标识别结果。
  14. 根据权利要求13所述的数据处理设备,其特征在于,所述获取单元,具体用于将所述第一识别结果与所述第二识别结果中相同位置字符对应的第一概率与第二概率相加;
    所述获取单元,具体用于基于相加后的概率确定所述目标识别结果。
  15. 根据权利要求9至14中任一项所述的数据处理设备,其特征在于,所述融合单元,具体用于将相同位置字符的所述第一模态特征与所述第二模态特征融合以得到所述目标特征。
  16. 根据权利要求9至15中任一项所述的数据处理设备,其特征在于,所述获取单元,具体用于确定目标特征与多个字符的对应关系;
    所述获取单元,具体用于获取多个字符的排列方式集合,所述排列方式集合包括多种排列方式;
    所述获取单元,具体用于基于所述排列方式集合中的每个排列方式对所述每个排列方式下的最后一个字符进行最大似然估计,以得到所述第一识别结果。
  17. 一种数据处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述数据处理设备执行如权利要求1至8中任一项所述的方法。
  18. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在终端设备上运行时,使得所述终端设备执行如权利要求1至8中任一项所述的方法。
  19. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至8中任一项所述的方法。
PCT/CN2023/119082 2022-10-20 2023-09-15 一种数据处理方法及相关设备 WO2024082891A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211289351.XA CN117917702A (zh) 2022-10-20 2022-10-20 一种数据处理方法及相关设备
CN202211289351.X 2022-10-20

Publications (1)

Publication Number Publication Date
WO2024082891A1 true WO2024082891A1 (zh) 2024-04-25

Family

ID=90729619

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/119082 WO2024082891A1 (zh) 2022-10-20 2023-09-15 一种数据处理方法及相关设备

Country Status (2)

Country Link
CN (1) CN117917702A (zh)
WO (1) WO2024082891A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019074807A (ja) * 2017-10-12 2019-05-16 富士ゼロックス株式会社 情報処理装置及びプログラム
CN111738251A (zh) * 2020-08-26 2020-10-02 北京智源人工智能研究院 一种融合语言模型的光学字符识别方法、装置和电子设备
CN112257426A (zh) * 2020-10-14 2021-01-22 北京一览群智数据科技有限责任公司 一种文字识别方法、系统、训练方法、存储介质及设备
CN112687296A (zh) * 2021-03-10 2021-04-20 北京世纪好未来教育科技有限公司 音频不流利的识别方法、装置、设备及可读存储介质
CN113822340A (zh) * 2021-08-27 2021-12-21 北京工业大学 一种基于注意力机制的图文情感识别方法
CN115116444A (zh) * 2022-05-31 2022-09-27 腾讯科技(深圳)有限公司 一种语音识别文本的处理方法、装置、设备及存储介质
CN116434752A (zh) * 2023-05-11 2023-07-14 京东科技信息技术有限公司 语音识别纠错方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019074807A (ja) * 2017-10-12 2019-05-16 富士ゼロックス株式会社 情報処理装置及びプログラム
CN111738251A (zh) * 2020-08-26 2020-10-02 北京智源人工智能研究院 一种融合语言模型的光学字符识别方法、装置和电子设备
CN112257426A (zh) * 2020-10-14 2021-01-22 北京一览群智数据科技有限责任公司 一种文字识别方法、系统、训练方法、存储介质及设备
CN112687296A (zh) * 2021-03-10 2021-04-20 北京世纪好未来教育科技有限公司 音频不流利的识别方法、装置、设备及可读存储介质
CN113822340A (zh) * 2021-08-27 2021-12-21 北京工业大学 一种基于注意力机制的图文情感识别方法
CN115116444A (zh) * 2022-05-31 2022-09-27 腾讯科技(深圳)有限公司 一种语音识别文本的处理方法、装置、设备及存储介质
CN116434752A (zh) * 2023-05-11 2023-07-14 京东科技信息技术有限公司 语音识别纠错方法和装置

Also Published As

Publication number Publication date
CN117917702A (zh) 2024-04-23

Similar Documents

Publication Publication Date Title
WO2021042828A1 (zh) 神经网络模型压缩的方法、装置、存储介质和芯片
WO2020228376A1 (zh) 文本处理方法、模型训练方法和装置
WO2020238293A1 (zh) 图像分类方法、神经网络的训练方法及装置
US20230082173A1 (en) Data processing method, federated learning training method, and related apparatus and device
WO2019228317A1 (zh) 人脸识别方法、装置及计算机可读介质
WO2019228358A1 (zh) 深度神经网络的训练方法和装置
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
US20220375213A1 (en) Processing Apparatus and Method and Storage Medium
CN112395979B (zh) 基于图像的健康状态识别方法、装置、设备及存储介质
WO2021184902A1 (zh) 图像分类方法、装置、及其训练方法、装置、设备、介质
EP4163831A1 (en) Neural network distillation method and device
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN110222718B (zh) 图像处理的方法及装置
WO2021018245A1 (zh) 图像分类方法及装置
WO2021047587A1 (zh) 手势识别方法、电子设备、计算机可读存储介质和芯片
WO2023165361A1 (zh) 一种数据处理方法及相关设备
WO2022179586A1 (zh) 一种模型训练方法及其相关联设备
US20230020965A1 (en) Method and apparatus for updating object recognition model
WO2020192523A1 (zh) 译文质量检测方法、装置、机器翻译系统和存储介质
US20240152770A1 (en) Neural network search method and related device
WO2024067884A1 (zh) 一种数据处理方法及相关装置
US20240046067A1 (en) Data processing method and related device
CN115601820A (zh) 一种人脸伪造图像检测方法、装置、终端及存储介质
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2022063076A1 (zh) 对抗样本的识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23878885

Country of ref document: EP

Kind code of ref document: A1