CN113920293A - Information identification method and device, electronic equipment and storage medium - Google Patents

Information identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113920293A
CN113920293A CN202111210987.6A CN202111210987A CN113920293A CN 113920293 A CN113920293 A CN 113920293A CN 202111210987 A CN202111210987 A CN 202111210987A CN 113920293 A CN113920293 A CN 113920293A
Authority
CN
China
Prior art keywords
characters
image
line
lines
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111210987.6A
Other languages
Chinese (zh)
Inventor
马龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202111210987.6A priority Critical patent/CN113920293A/en
Publication of CN113920293A publication Critical patent/CN113920293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The disclosure relates to an information identification method, an information identification device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring an image to be identified; acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in an image to be identified; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics of each line of characters in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.

Description

Information identification method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of information technologies, and in particular, to an information identification method and apparatus, an electronic device, and a storage medium.
Background
With the continuous development of science and technology, the technology of OCR (Optical Character Recognition) is also greatly developed, and a user can conveniently recognize characters on an image through a related application program. However, for pictures with rich information in text or background, such as posters or some video pictures, the results identified by the prior art often fail to meet the needs of users.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides an information identification method, apparatus, electronic device, and storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided an information identification method, including:
acquiring an image to be recognized, wherein the image to be recognized comprises a plurality of lines of characters;
respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics in the image to be identified of each line of characters;
splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of the lines of characters;
and inputting the multi-modal characteristics of the plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line.
Optionally, the training process of the preset model includes:
acquiring a training sample image, wherein the training sample image comprises a plurality of lines of characters;
extracting sample features of the training sample image, wherein the sample features comprise image features of a plurality of lines of characters, character content features and position features of the character content features in the corresponding sample image;
inputting the sample characteristics into a preset model to obtain sample pointer positions corresponding to each line of characters of the training sample image, calculating a loss value between the sample pointer positions and labeling pointer positions corresponding to each line of characters of the training sample image labeled in advance through a target loss function, and obtaining the trained preset model when the loss value is smaller than a threshold value.
Optionally, the method further includes:
sequencing the multiple lines of characters in the image to be recognized based on the pointer position corresponding to each line of characters in the image to be recognized respectively to obtain a sequencing result;
and displaying the lines of characters included by the image to be recognized according to the sorting result.
Optionally, the acquiring the image feature of each line of text includes:
extracting image features of each line of characters through a convolutional neural network, wherein the image features comprise: one or more of character size characteristic, character color characteristic, texture characteristic and background characteristic.
Optionally, the image feature, the text content feature and the position feature are 256-dimensional feature vectors respectively, and the multi-modal feature is a 256 × 3-dimensional feature vector obtained by splicing the image feature, the text content feature and the position feature.
According to a second aspect of the embodiments of the present disclosure, there is provided an information identifying apparatus including:
the image acquisition module is configured to acquire an image to be recognized, and the image to be recognized comprises a plurality of lines of characters;
the characteristic acquisition module is configured to respectively acquire the image characteristic, the text content characteristic and the position characteristic of the text content characteristic of each line of text in the image to be identified;
the characteristic splicing module is configured to splice the image characteristics, the text content characteristics and the position characteristics of each line of texts to obtain multi-modal characteristics of the lines of texts;
and the position acquisition module is configured to input the multi-modal characteristics of the lines of characters into the trained preset model and output the pointer positions corresponding to the lines of characters respectively, wherein the pointer positions corresponding to the lines of characters are used for representing the output positions corresponding to the lines of characters.
Optionally, the system further includes a feature training module, where the feature training module is specifically configured to perform:
acquiring a training sample image, wherein the training sample image comprises a plurality of lines of characters;
extracting sample features of the training sample image, wherein the sample features comprise image features of a plurality of lines of characters, character content features and position features of the character content features in the corresponding sample image;
inputting the sample characteristics into a preset model to obtain sample pointer positions corresponding to each line of characters of the training sample image, calculating a loss value between the sample pointer positions and labeling pointer positions corresponding to each line of characters of the training sample image labeled in advance through a target loss function, and obtaining the trained preset model when the loss value is smaller than a threshold value.
Optionally, the apparatus further comprises:
the character sorting module is configured to execute sorting of multiple lines of characters in the image to be recognized based on the pointer position corresponding to each line of characters in the image to be recognized respectively, so that a sorting result is obtained;
and the character display module is configured to display the plurality of lines of characters included in the image to be recognized according to the sorting result.
Optionally, the feature obtaining module is specifically configured to perform:
extracting image features of each line of characters through a convolutional neural network, wherein the image features comprise: one or more of character size characteristic, character color characteristic, texture characteristic and background characteristic.
Optionally, the image feature, the text content feature and the position feature are 256-dimensional feature vectors respectively, and the multi-modal feature is a 256 × 3-dimensional feature vector obtained by splicing the image feature, the text content feature and the position feature.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the information identification method of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the information identification method of the first aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, the computer program product comprising computer instructions that, when run on an electronic device, enable the electronic device to perform the steps of the information identification method of the first aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the technical scheme provided by the embodiment of the disclosure, an image to be identified is obtained; respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in the image to be recognized; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by simultaneously combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow chart illustrating a method of information identification according to an exemplary embodiment;
FIG. 2 is another flow chart illustrating a method of information identification according to an exemplary embodiment;
FIG. 3 is a block diagram illustrating an information recognition apparatus according to an example embodiment;
FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating an information recognition method according to an exemplary embodiment, and the method is used in a terminal, as shown in fig. 1, and may include the following steps:
in step S110, an image to be recognized is acquired.
Wherein the image to be recognized comprises a plurality of lines of text.
Specifically, OCR technology detects and recognizes characters in an image as text, and in most cases, OCR recognizes characters in units of lines. In practical applications, an image typically includes multiple lines of text. For example, in a short video application scenario, multiple lines of text are often included in a frame of image, with different lines of text representing different meanings.
In step S120, the image feature, the text content feature, and the position feature of the text content feature in the image to be recognized of each line of text are respectively obtained.
Specifically, since the image to be recognized usually contains rich information, besides the text information, there are also background images and the like, and in addition, the text on the image to be recognized usually has fonts with different sizes. Characters of the same size, even if displayed in different rows, often represent semantic consistency, and characters having the same background and texture are often semantically consistent. Therefore, when a plurality of lines of characters in an image to be recognized are recognized, image features, character content features and position features of the character content features in each line of characters in the image to be recognized need to be acquired, so that in the subsequent step, the output positions of the plurality of lines of characters in the image to be recognized can be accurately determined.
In order to accurately obtain the image features of each line of characters, in one embodiment, the method for obtaining the image features of each line of characters may include the following steps:
extracting image features of each line of characters through a convolutional neural network, wherein the image features comprise: one or more of character size characteristic, character color characteristic, texture characteristic and background characteristic.
In this embodiment, the image features of the text region in the image to be recognized may be extracted by the convolutional neural network CNN. Wherein the image features include: one or more of character size characteristic, character color characteristic, texture characteristic and background characteristic. Therefore, the image features of each line of characters are not only single features, but also various features such as character size features, character color features, texture features, background features and the like, so that the obtained image features are more accurate.
And when the character content characteristics of each line of characters are obtained, a plurality of lines of characters in the image to be recognized can be input into the trained Bert model, and the plurality of lines of characters in the image to be recognized are recognized through the trained Bert model, so that the character content characteristics of each line of characters are obtained. And the position characteristics of the character content characteristics in the image to be identified can be extracted by adopting multilayer cascade connection on the coordinates of the character row.
In step S130, the image features, the text content features, and the position features of each line of text are spliced to obtain multi-modal features of multiple lines of text.
Specifically, in order to accurately determine the output position of each line of text in the subsequent step, the image features, text content features, and position features of each line of text may be spliced to obtain the multi-modal features of each line of text. Because the multi-modal characteristics of each line of characters are spliced based on the three characteristics of the image characteristics, the character content characteristics and the position characteristics, the output positions of the lines of characters can be more accurately determined according to the multi-modal characteristics of the lines of characters in the subsequent steps.
In one embodiment, the image feature, the text feature and the position feature are 256-dimensional feature vectors, and the multi-modal feature is a 256 × 3-dimensional feature vector obtained by splicing the image feature, the text feature and the position feature.
In this embodiment, the image feature, the text content feature, and the position feature of each line of text can be represented by 256-dimensional feature vectors, and the three 256-dimensional feature vectors are spliced to obtain 256 × 3-dimensional multi-modal feature vectors, that is, the multi-modal feature can be 768-dimensional feature vectors.
As can be seen from the above description, the image features, the text content features and the position features are feature vectors with higher dimensions, so that the accuracy of the image features, the text content features and the position features is higher; and then the image features, the character content features and the position features are spliced, so that the obtained multi-modal features are high in accuracy, namely the multi-modal features of each line of characters can accurately represent the features of the line of characters, and the output position of the line of characters can be accurately determined based on the multi-modal features of the line of characters.
In step S140, the multi-modal features of the lines of text are input into the trained preset model, and the pointer positions corresponding to the lines of text are output.
And the pointer position corresponding to each line of characters is used for representing the output position corresponding to the line of characters.
In the embodiment provided by the disclosure, the preset model can be a Pointer Network, the Pointer Network is trained through a plurality of labeled samples, and the trained preset model can be obtained. And inputting the multi-modal characteristics obtained by splicing the image characteristics, the character content characteristics and the position characteristics into the trained preset model to obtain the pointer position corresponding to each line of characters, wherein the pointer position corresponding to each line of characters is used for representing the output position corresponding to the line of characters.
Illustratively, the image to be recognized includes 3 lines of characters, and the pointer positions of the output result corresponding to each line of characters are 2, 1, and 1, respectively, which represent the output line numbers corresponding to the 3 lines of characters, that is, the output position of the original first line of characters in the image to be recognized is the second line, and the output positions of the original second line of characters and the original third line of characters are both the first line. Of course, this is only a simple example, and the purpose of the embodiment of the present disclosure is to output characters with the same character size, color, and the like on the same line, and by adjusting the output position of outputting multiple lines of characters, the output characters can be semantically consistent.
According to the technical scheme provided by the embodiment of the disclosure, an image to be identified is obtained; respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in the image to be recognized; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by simultaneously combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.
In combination with the above embodiment, in a further embodiment provided by the present disclosure, as shown in fig. 2, the training process of the preset model may include the following steps:
step S210, a training sample image is obtained.
Wherein the training sample image contains a plurality of lines of text.
Specifically, in order to enable the trained model to accurately output positions corresponding to multiple lines of characters in one image, when a training sample image is obtained, a large number of images containing multiple lines of characters need to be obtained, and the sample image contains corresponding image backgrounds and textures and character lines with characters of different sizes.
Moreover, after the training sample image is acquired, the output position of each line of characters can be labeled in advance. Specifically, a plurality of lines of characters included in each training sample image are known, so that output positions of the plurality of lines of characters can be labeled, and the line semantics of the characters output through the labeled output positions have continuity.
Step S220, extracting sample features of the training sample image.
The sample features comprise image features of multiple lines of characters of the training sample image, character content features and position features of the character content features in the corresponding sample image.
Specifically, after the training sample image is obtained, the sample characteristics of the training sample image may be extracted, and specifically, three network models may be respectively adopted to extract the sample characteristics. Extracting the recognition result of the character line through the trained Bert model to obtain a 256-dimensional feature vector; extracting image features in the text area by adopting the trained convolutional neural network CNN to obtain 256-dimensional feature vectors; and extracting the character line coordinates by adopting multi-cascade FC to obtain 256-dimensional feature vectors.
Step S230, inputting the sample characteristics into a preset model, obtaining sample pointer positions corresponding to each line of text of the training sample image, calculating a loss value between the sample pointer position and a labeled pointer position corresponding to each line of text of the pre-labeled training sample image through a target loss function, and obtaining the trained preset model when the loss value is smaller than a threshold value.
Specifically, after the sample features of the training sample image are obtained, the sample features of the training sample image may be input into a preset model, the preset model is trained, and the sample pointer positions corresponding to each line of text are output from the preset model. Because the annotation pointer position corresponding to each line of characters of the training sample image is labeled in advance, the annotation pointer position corresponding to each line of characters is used for representing the output position corresponding to each line of characters when the semantics are coherent, that is, the annotation pointer position is a true value. Therefore, when the preset model is trained, the loss value between the sample pointer position and the annotation pointer position corresponding to each line of characters of the pre-annotated training sample image is calculated through the target loss function, and when the loss value is smaller than the threshold value, the sample pointer position output from the preset model is close to the pre-annotated annotation pointer position, that is, the accuracy of the preset model is high, and at this time, the trained preset model is obtained.
From the above description, when the preset model is trained, the sample features of the training sample image are also multi-modal features, and the multi-modal features have higher dimensionality, that is, the multi-modal features are more accurate. And the sample pointer position output from the trained preset model is close to the pre-labeled label pointer position, that is, the trained preset model is accurate and high. And then after the image to be recognized is input into the trained preset model, the accuracy of the output positions of the multiple lines of characters in the image to be recognized is high, and the semantic consistency of the obtained character recognition result can be kept.
With reference to the foregoing embodiment, in a further embodiment provided by the present disclosure, the information identification method may further include the following steps:
step a1, based on the pointer position corresponding to each line of characters in the image to be recognized, sorting the lines of characters in the image to be recognized to obtain a sorting result.
Step a2, displaying a plurality of lines of characters included in the image to be recognized according to the sorting result.
Specifically, because the pointer position corresponding to each line of characters can be used for representing the output position corresponding to the line of characters, and the output position corresponding to the line of characters is obtained after the pointer position corresponding to the line of characters is obtained, after the pointer position corresponding to each line of characters is obtained, the line of characters can be sorted according to the pointer position corresponding to each line of characters, so that a sorting result is obtained, and the line of characters are displayed according to the sorting result.
For example, the image to be recognized includes 3 lines of characters, and the pointer positions of each line of characters corresponding to the output result are 3, 1, and 2, respectively, which represent the output positions corresponding to the 3 lines of characters, i.e., the output position of the original first line of characters in the image to be recognized is the third line, the output position of the original second line of characters is the first line, and the output position of the original third line of characters is the second line. The three rows of positions are displayed after being sorted according to the positions of the pointers, namely, the second row of characters, the third row of characters and the first row of characters are displayed in sequence.
Therefore, by the technical scheme provided by the embodiment of the disclosure, the output positions of the multiple lines of characters can be adjusted, so that the output positions of the multiple lines of characters are more accurate, and the semantic consistency of the obtained character recognition result can be kept.
Fig. 3 is a block diagram illustrating an information recognition apparatus according to an example embodiment. Referring to fig. 3, the apparatus includes an image obtaining module 310 configured to perform obtaining an image to be recognized, where the image to be recognized includes a plurality of lines of text;
the feature obtaining module 320 is configured to perform obtaining of an image feature, a text content feature, and a position feature of the text content feature in the image to be recognized of each line of text, respectively;
the feature splicing module 330 is configured to splice the image features, the text content features and the position features of each line of text to obtain multi-modal features of the lines of text;
the position obtaining module 340 is configured to input the multi-modal features of the lines of characters into the trained preset model, and output pointer positions corresponding to the lines of characters, where the pointer positions corresponding to the lines of characters are used to represent output positions corresponding to the lines of characters.
Optionally, the system further includes a feature training module, where the feature training module is specifically configured to perform:
acquiring a training sample image, wherein the training sample image comprises a plurality of lines of characters;
extracting sample features of the training sample image, wherein the sample features comprise image features of a plurality of lines of characters, character content features and position features of the character content features in the corresponding sample image;
inputting the sample characteristics into a preset model to obtain sample pointer positions corresponding to each line of characters of the training sample image, calculating a loss value between the sample pointer positions and labeling pointer positions corresponding to each line of characters of the training sample image labeled in advance through a target loss function, and obtaining the trained preset model when the loss value is smaller than a threshold value.
Optionally, the apparatus further comprises:
the character sorting module is configured to execute sorting of multiple lines of characters in the image to be recognized based on the pointer position corresponding to each line of characters in the image to be recognized respectively, so that a sorting result is obtained;
and the character display module is configured to display the plurality of lines of characters included in the image to be recognized according to the sorting result.
Optionally, the feature obtaining module is specifically configured to perform:
extracting image features of each line of characters through a convolutional neural network, wherein the image features comprise: one or more of character size characteristic, character color characteristic, texture characteristic and background characteristic.
Optionally, the image feature, the text content feature and the position feature are 256-dimensional feature vectors respectively, and the multi-modal feature is a 256 × 3-dimensional feature vector obtained by splicing the image feature, the text content feature and the position feature.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the technical scheme provided by the embodiment of the disclosure, an image to be identified is obtained; respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in the image to be recognized; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by simultaneously combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the information identification method of the first aspect.
Fig. 4 is a block diagram illustrating an apparatus 800 for information recognition according to an example embodiment. For example, the apparatus 800 is an electronic device, and may be specifically a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the … method described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
According to the technical scheme provided by the embodiment of the disclosure, an image to be identified is obtained; respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in the image to be recognized; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by simultaneously combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the information identification method of the first aspect.
According to the technical scheme provided by the embodiment of the disclosure, an image to be identified is obtained; respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in the image to be recognized; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by simultaneously combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, the computer program product comprising computer instructions that, when run on an electronic device, enable the electronic device to perform the steps of the information identification method of the first aspect.
According to the technical scheme provided by the embodiment of the disclosure, an image to be identified is obtained; respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics of each line of characters in the image to be recognized; splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of a plurality of lines of characters; inputting the multi-modal characteristics of a plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line. Therefore, according to the technical scheme provided by the embodiment of the disclosure, when the image to be recognized is subjected to character recognition, the output positions of the multiple lines of characters can be adjusted by simultaneously combining the image characteristics, the character content characteristics and the position characteristics of the character content characteristics in the image to be recognized, so that the output positions of the multiple lines of characters are more accurate, and the obtained character recognition result can keep semantic consistency.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the disclosure are, in whole or in part, generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber, DSL (Digital Subscriber Line)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD (Digital Versatile Disk)), or a semiconductor medium (e.g., an SSD (Solid State Disk)), etc.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. An information identification method, comprising:
acquiring an image to be recognized, wherein the image to be recognized comprises a plurality of lines of characters;
respectively acquiring image characteristics, character content characteristics and position characteristics of the character content characteristics in the image to be identified of each line of characters;
splicing the image characteristics, the character content characteristics and the position characteristics of each line of characters to obtain multi-mode characteristics of the lines of characters;
and inputting the multi-modal characteristics of the plurality of lines of characters into a trained preset model, and outputting pointer positions corresponding to the characters in each line, wherein the pointer positions corresponding to the characters in each line are used for representing the output positions corresponding to the characters in the line.
2. The method of claim 1, wherein the training process of the preset model comprises:
acquiring a training sample image, wherein the training sample image comprises a plurality of lines of characters;
extracting sample features of the training sample image, wherein the sample features comprise image features of a plurality of lines of characters, character content features and position features of the character content features in the corresponding sample image;
inputting the sample characteristics into a preset model to obtain sample pointer positions corresponding to each line of characters of the training sample image, calculating a loss value between the sample pointer positions and labeling pointer positions corresponding to each line of characters of the training sample image labeled in advance through a target loss function, and obtaining the trained preset model when the loss value is smaller than a threshold value.
3. The method of claim 1, further comprising:
sequencing the multiple lines of characters in the image to be recognized based on the pointer position corresponding to each line of characters in the image to be recognized respectively to obtain a sequencing result;
and displaying the lines of characters included by the image to be recognized according to the sorting result.
4. The method according to any one of claims 1 to 3, wherein the obtaining of the image feature of each line of characters comprises:
extracting image features of each line of characters through a convolutional neural network, wherein the image features comprise: one or more of character size characteristic, character color characteristic, texture characteristic and background characteristic.
5. The method according to any one of claims 1 to 3, wherein the image feature, the text feature and the position feature are 256-dimensional feature vectors, respectively, and the multi-modal feature is a 256 x 3-dimensional feature vector obtained by concatenating the image feature, the text feature and the position feature.
6. An information identifying apparatus, comprising:
the image acquisition module is configured to acquire an image to be recognized, and the image to be recognized comprises a plurality of lines of characters;
the characteristic acquisition module is configured to respectively acquire the image characteristic, the text content characteristic and the position characteristic of the text content characteristic of each line of text in the image to be identified;
the characteristic splicing module is configured to splice the image characteristics, the text content characteristics and the position characteristics of each line of texts to obtain multi-modal characteristics of the lines of texts;
and the position acquisition module is configured to input the multi-modal characteristics of the lines of characters into the trained preset model and output the pointer positions corresponding to the lines of characters respectively, wherein the pointer positions corresponding to the lines of characters are used for representing the output positions corresponding to the lines of characters.
7. The apparatus of claim 6, further comprising a feature training module, the feature training module specifically configured to perform:
acquiring a training sample image, wherein the training sample image comprises a plurality of lines of characters;
extracting sample features of the training sample image, wherein the sample features comprise image features of a plurality of lines of characters, character content features and position features of the character content features in the corresponding sample image;
inputting the sample characteristics into a preset model to obtain sample pointer positions corresponding to each line of characters of the training sample image, calculating a loss value between the sample pointer positions and labeling pointer positions corresponding to each line of characters of the training sample image labeled in advance through a target loss function, and obtaining the trained preset model when the loss value is smaller than a threshold value.
8. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the information identification method of any one of claims 1-5.
9. A non-transitory computer readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the information identification method of any one of claims 1-5.
10. A computer program product, characterized in that it comprises computer instructions which, when run on an electronic device, cause the electronic device to carry out the steps of the information identification method according to any one of claims 1 to 5.
CN202111210987.6A 2021-10-18 2021-10-18 Information identification method and device, electronic equipment and storage medium Pending CN113920293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111210987.6A CN113920293A (en) 2021-10-18 2021-10-18 Information identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111210987.6A CN113920293A (en) 2021-10-18 2021-10-18 Information identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113920293A true CN113920293A (en) 2022-01-11

Family

ID=79241698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111210987.6A Pending CN113920293A (en) 2021-10-18 2021-10-18 Information identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113920293A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202647A (en) * 2022-02-16 2022-03-18 阿里巴巴达摩院(杭州)科技有限公司 Method, device and equipment for recognizing text in image and storage medium
CN114239760A (en) * 2022-02-25 2022-03-25 苏州浪潮智能科技有限公司 Multi-modal model training and image recognition method and device, and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202647A (en) * 2022-02-16 2022-03-18 阿里巴巴达摩院(杭州)科技有限公司 Method, device and equipment for recognizing text in image and storage medium
CN114202647B (en) * 2022-02-16 2022-07-05 阿里巴巴达摩院(杭州)科技有限公司 Method, device and equipment for recognizing text in image and storage medium
CN114239760A (en) * 2022-02-25 2022-03-25 苏州浪潮智能科技有限公司 Multi-modal model training and image recognition method and device, and electronic equipment
WO2023159945A1 (en) * 2022-02-25 2023-08-31 苏州浪潮智能科技有限公司 Multi-modal model training method and apparatus, image recognition method and apparatus, and electronic device

Similar Documents

Publication Publication Date Title
CN108038102B (en) Method and device for recommending expression image, terminal and storage medium
CN111539443A (en) Image recognition model training method and device and storage medium
CN110781813B (en) Image recognition method and device, electronic equipment and storage medium
CN111461304B (en) Training method of classified neural network, text classification method, device and equipment
CN110781323A (en) Method and device for determining label of multimedia resource, electronic equipment and storage medium
CN107229403B (en) Information content selection method and device
US11335348B2 (en) Input method, device, apparatus, and storage medium
CN109886211B (en) Data labeling method and device, electronic equipment and storage medium
CN110764627B (en) Input method and device and electronic equipment
CN113920293A (en) Information identification method and device, electronic equipment and storage medium
CN111797262A (en) Poetry generation method and device, electronic equipment and storage medium
CN111046927A (en) Method and device for processing labeled data, electronic equipment and storage medium
CN107943317B (en) Input method and device
CN112381091A (en) Video content identification method and device, electronic equipment and storage medium
CN113312967A (en) Detection method, device and device for detection
CN111079421B (en) Text information word segmentation processing method, device, terminal and storage medium
CN110738267B (en) Image classification method, device, electronic equipment and storage medium
CN110213062B (en) Method and device for processing message
CN112328809A (en) Entity classification method, device and computer readable storage medium
CN112035691A (en) Method, device, equipment and medium for displaying cell labeling data of slice image
CN112035651A (en) Sentence completion method and device and computer-readable storage medium
CN114466204B (en) Video bullet screen display method and device, electronic equipment and storage medium
CN112036247A (en) Expression package character generation method and device and storage medium
CN111428806B (en) Image tag determining method and device, electronic equipment and storage medium
CN115484471B (en) Method and device for recommending anchor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination