Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
If media data of an object (such as a user) needs to be collected in the application, before and during collection, a prompt interface or a popup window is displayed, wherein the prompt interface or the popup window is used for prompting the user to collect XXXX data currently, and the related step of data acquisition is started only after the user sends confirmation operation to the prompt interface or the popup window is acquired, otherwise, the step is ended. Moreover, the acquired user media data may be used in a reasonable and legal scene, usage, or the like. Optionally, in some scenarios where user media data is required but not authorized by the user, authorization may be requested from the user, and the user media data may be reused when the authorization passes.
The application relates to a deep learning technology in the field of artificial intelligence, and the deep learning technology is used for realizing the prediction of position information between a first type object and a second type object, the training of a position recognition model and the like.
Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
Machine learning (MACHINE LEARNING) is a discipline that specializes in studying how computers simulate or implement learning behavior in humans to obtain new knowledge or skills, reorganizing existing knowledge structures to continually improve their own performance.
Deep learning (DL, deep Learning) is a new research direction in the field of machine learning (ML, machine Learning) that was introduced into machine learning to bring it closer to the original goal-artificial intelligence (AI, artificial Intelligence). Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art. Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization techniques, and other related fields. The deep learning makes the machine imitate the activities of human beings such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes the related technology of artificial intelligence greatly advanced.
Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an embodiment of the application. As shown in fig. 1, the system may include a computer device 100 and a terminal cluster, which may include a terminal device 200a, a terminal device 200b, a terminal device 200c, a terminal device 200n, it being understood that the above system may include one or more terminal devices, and the present application is not limited to the number of terminal devices. The above-mentioned terminal device may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted device, an augmented Reality/Virtual Reality (AR/VR) device, a head-mounted display, a smart television, a wearable device, a smart speaker, a digital camera, a camera, and other mobile internet devices (mobile INTERNET DEVICE, MID) with network access capability, or a terminal device in a train, a ship, a flight, or the like.
The computer device mentioned in the present application may be a server or a terminal device, or may be a system composed of a server and a terminal device.
Wherein a communication connection may exist between the terminal clusters, for example, a communication connection exists between the terminal device 200a and the terminal device 200b, and a communication connection exists between the terminal device 200a and the terminal device 200 c. Meanwhile, any terminal device in the terminal cluster may have a communication connection with the service server 100, for example, a communication connection exists between the terminal device 200a and the service server 100, where the communication connection is not limited to a connection manner, may be directly or indirectly connected through a wired communication manner, may be directly or indirectly connected through a wireless communication manner, and may also be other manners, and the application is not limited herein.
It should be understood that each terminal device in the terminal cluster as shown in fig. 1 may be a terminal device integrated with the function of identifying the object media data at any one of the terminal devices or a computer device integrated with the function of identifying the object media data.
The object media data refers to data of multiple object types contained in the same page, and may specifically be data related to a user, for example, data such as invoice information (e.g., taxi taking invoice information) of the user, medical bill data, information storage address of the user, and the like. The computer device may acquire object media data to be separated from any one of the terminal devices or the computer device itself, perform separation location information on the object media data, or acquire sample object media data for performing model training from any one of the terminal devices or the computer device itself, perform model training based on the acquired sample object media data, and obtain a location recognition model.
For the convenience of subsequent understanding and description, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for separating position information according to an embodiment of the present application. In fig. 2, a computer device 300 may obtain a media data feature corresponding to an object media data, perform feature extraction on the media data feature by using a first channel attention parameter to obtain a first type object feature corresponding to the media data feature, perform feature extraction on the media data feature by using a second channel attention parameter to obtain a second type object feature corresponding to the media data feature, perform feature recognition on the first type object feature to obtain location information of the first type object in the object media data, and perform feature recognition on the second type object feature to obtain location information of the second type object in the object media data. Specifically, the location recognition model may be obtained through model training, and the computer device 300 may obtain initial media data and sample media data, and obtain initial media features of the initial media data and initial sample features of the sample media data. The computer device 300 may perform convolution processing and upsampling processing in the initial position identification model on the initial media characteristics to obtain media data characteristics. The computer device 300 may perform feature extraction on the media data feature by using the first channel attention parameter to obtain a first type of object feature corresponding to the media data feature. The computer device 300 may employ the second channel attention parameter to perform feature extraction on the media data feature, so as to obtain a second type object feature corresponding to the media data feature. The computer device 300 may perform feature preprocessing on the features of the first type object to obtain a first probability distribution map and a first threshold map corresponding to the first type object in the object media data, perform binarization processing on the first probability distribution map and the first threshold map to obtain a first approximate binary map corresponding to the first type object in the object media data, and determine the position information of the first type object in the object media data according to the first approximate binary map. The computer device 300 may perform feature preprocessing on the features of the second type object to obtain a second probability distribution map and a second threshold map corresponding to the second type object in the object media data, perform binarization processing on the second probability distribution map and the second threshold map to obtain a second approximate binary map corresponding to the second type object in the object media data, and determine location information of the second type object in the object media data according to the second approximate binary map.
It may be understood that the method provided by the embodiment of the present application may be performed by a computer device, where a server may be an independent physical server, or may be a server cluster or a distributed system formed by multiple physical servers, or may be a cloud server that provides a cloud database, a cloud service, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and basic cloud computing services such as big data and an artificial intelligence platform. The terminal equipment comprises, but is not limited to, mobile phones, computers, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like. The terminal device and the computer device may be directly or indirectly connected through a wired or wireless manner, which is not limited in the embodiment of the present application.
It is to be understood that the system architecture described above may be applicable to a text detection system, a medical system, and a scenario for identifying and separating multiple types of objects (e.g., a scenario for identifying and separating printed characters from printed characters), and specific business scenarios will not be listed here.
Further, referring to fig. 3a, fig. 3a is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 3a, the data processing method may at least include the following steps S101 to S104.
Step S101, the media data feature corresponding to the object media data is acquired.
In particular, the object media data may refer to data that includes multiple object types in the same page. Data such as invoice information of the user, medical ticket data, acquisition form, information storage address of the user, and the like includes picture data. For example, invoice information or medical bill data may be taken as an example, the object media data may include a first type object and a second type object, optionally, the first type object may be a printed character, the second type object may be a printed character, and the object media data may include a first type object and a second type object, optionally, the first type object may be a printed character, the second type object may be a handwritten character, etc., taking an example of an acquisition form (such as a questionnaire, etc.). The media data characteristics corresponding to the object media data can be obtained by carrying out characteristic extraction processing on the object media data.
It should be appreciated that the object media data may be obtained after the direction adjustment, or may be obtained directly. Specifically, initial media data can be acquired, the initial media data is determined to be target media data, or semantic analysis can be carried out on the initial media data to obtain content direction information in the initial media data, and direction adjustment is carried out on the initial media data based on the content direction information to obtain the target media data.
Optionally, in a possible manner of acquiring the content direction information, convolution and activation processing may be performed on the initial media data to obtain the initial activation feature corresponding to the initial media data. And performing feature conversion processing on the primary activation feature based on the hourglass layer to obtain a media adjustment feature. The media adjustment feature may be used to represent preliminary detection information of content (such as characters, etc.) in the initial media data, that is, relevant features that are preliminarily extracted from the position and direction of the characters, etc. in the initial media data. And carrying out pooling treatment on the media adjustment characteristics to obtain pooled characteristics, and carrying out full-connection prediction treatment on the pooled characteristics to obtain content direction information in the initial media data. That is, the media adjustment feature is further feature-extracted and converted to obtain content direction information in the initial media data, where the content direction information is used to represent the direction of the content in the initial media data, such as deviation from the standard content direction. For example, if the content direction information of the medical bill data is 0 degrees, the direction of the content in the medical bill data is the same as the standard content direction, if the content direction information of the medical bill data is 90 degrees clockwise, the direction included angle between the direction of the content in the medical bill data and the standard content direction is 90 degrees clockwise, if the content direction information of the medical bill data is 180 degrees clockwise, the direction included angle between the direction of the content in the medical bill data and the standard content direction is 180 degrees clockwise, and if the content direction information of the medical bill data is 270 degrees clockwise, the direction included angle between the direction of the content in the medical bill data and the standard content direction is 270 degrees clockwise. And carrying out direction adjustment on the initial media data based on the content direction information to obtain the object media data.
For example, when the included angle between the bill character direction of one medical bill data and the standard character direction is 90 degrees clockwise, i.e. the content direction information is 90 degrees clockwise, the direction of the initial media data can be adjusted by the content direction information, i.e. the initial media data is rotated 90 degrees counterclockwise, so as to obtain the target media data with the same standard content direction (i.e. the content direction information is 0 degrees). By adjusting the direction of the initial media data, the object separation can be carried out on the object media data under the same angle, and the direction generalization processing of the characteristics in the object media data is reduced, so that the efficiency of the object separation is improved, and the accuracy of the object separation is improved to a certain extent.
Alternatively, the hourglass layer may include a channel convolution layer, a depth convolution layer, a normalization layer, and an activation layer. And inputting the primary activation characteristic into a channel convolution layer, and carrying out at least two channel convolution processes according to the magnitude to obtain a first object characteristic. And inputting the first object feature into a deep convolution layer, and performing deep convolution processing to obtain a second object feature. And inputting the second object feature into a standardization layer, and carrying out standardization processing to obtain a third object feature. And inputting the third object feature into an activation layer, and performing activation processing to obtain the media adjustment feature. Alternatively, the hourglass layer may also be other structures, such as a convolution layer, an activation layer, or the like, where the initial activation feature is input to the convolution layer to perform convolution processing to obtain an object convolution feature, and the object convolution feature is input to the activation layer to perform feature adjustment to obtain a media adjustment feature, which is not limited herein.
Referring to fig. 3b, fig. 3b is a schematic structural diagram of an embodiment of the present application for acquiring object media data. As shown in fig. 3b, the initial media data is convolved with the activation process to obtain the preliminary activation feature. And inputting the initial activation characteristic into an hourglass layer, and performing characteristic conversion processing to obtain the media adjustment characteristic. Alternatively, the hourglass layer may comprise any one or more of a channel convolution layer, a depth convolution layer, a normalization layer, an activation layer, and the like. And inputting the primary activation characteristic into a channel convolution layer, and carrying out channel convolution processing to obtain a second object characteristic. And inputting the second object feature into a standardization layer, and carrying out standardization processing to obtain a third object feature. And inputting the third object feature into an activation layer, and performing activation processing to obtain the media adjustment feature.
It should be understood that the pooling process is performed on the media adjustment features obtained through the hourglass layer to obtain pooled features, and the full-connection prediction process is performed on the pooled features to obtain content direction information in the initial media data. And carrying out direction adjustment on the initial media data based on the content direction information to obtain the object media data.
The content orientation information may facilitate, upon scanning for characters in the medical ticket data, confirming a relationship of the orientation of the content in the initial media data to a standard content orientation, e.g., in the medical ticket data, may be considered as confirming a relationship between the character orientation in the medical ticket data and a standard character orientation (i.e., a standard content orientation). The direction adjustment of the initial media data is performed through the content direction information, and the direction of the whole medical bill data can be adjusted to obtain the object media data.
Further, the object media data may be obtained, and the feature extraction process may be performed on the object media data to obtain an initial media feature of the object media data, where the initial media feature is a feature mapping result of the object media data, that is, the object media data is mapped to a feature map to obtain features that may represent semantic information (such as color value information, texture information, and content meaning) of the initial media data, in other words, a feature mapping of the object media data, such as that the object media data is an image, and after the feature extraction process is performed on the object media data, the image is mapped to a feature, where the initial media feature is just a feature map for the object media data. And performing feature conversion on the initial media features to obtain media data features corresponding to the object media data.
Specifically, the method comprises the steps of carrying out multi-layer convolution on initial media characteristics to obtain N convolution media characteristics, wherein N is a positive integer, carrying out up-sampling characteristic fusion processing on the N convolution media characteristics to obtain convolution fusion characteristics, and carrying out characteristic fusion processing on the N convolution media characteristics and the convolution fusion characteristics to obtain media data characteristics corresponding to object media data.
Referring to fig. 4, fig. 4 is a schematic structural diagram for separating position information according to an embodiment of the application. As shown in fig. 4, the initial media characteristics of the target media data 400 are acquired, and N convolution check may be used to convolve the initial media characteristics of the target media data 400, so that N convolution media characteristics, such as a convolution media characteristic 401, a convolution media characteristic 401a, a convolution media characteristic 401b, and a convolution media characteristic 401c, may be obtained. The up-sampling process can be performed on the convolution media features, and feature fusion can be performed on the up-sampled convolution media features and adjacent convolution media features, so that convolution fusion features can be obtained. The up-sampling feature fusion process shown in fig. 4 is performed on N convolution media features (N in fig. 4 is a value of 4, but other values of N are not limited to N), so as to obtain convolution fusion features, such as convolution fusion features 403 and 403a shown in fig. 4. In particular, N convolutionally media features may be formed into M feature pairs, where M is a positive integer less than N, and optionally each feature pair includes two different convolutionally media features, e.g., two adjacent convolutionally media features of the N convolutionally media features may be formed into one feature pair, as shown in fig. 4, a feature pair formed of convolutionally media feature 401b and convolutionally media feature 401c, a feature pair formed of convolutionally media feature 401a and convolutionally media feature 401b, and a feature pair formed of convolutionally media feature 401 and convolutionally media feature 401a, that is, the feature pairs include convolutionally media features that are adjacent to each other, or a feature pair formed of two convolutionally media features that are spaced apart from each other of the N convolutionally media features, e.g., convolutionally media feature 401a and convolutionally media feature 401c may be formed into one feature pair, without limitation.
Specifically, taking a feature pair as an example, the convolution sizes of a first convolution media feature and a second convolution media feature included in the feature pair can be obtained, a first sampling parameter of the feature pair is determined based on the convolution sizes of the first convolution media feature and the second convolution media feature, up-sampling processing is performed on the first convolution media feature based on the sampling parameter to obtain a first sampling feature of the first convolution media feature, feature fusion processing is performed on the first sampling feature and the second convolution media feature to obtain a convolution fusion feature corresponding to the feature pair, wherein the convolution size of the first convolution media feature is smaller than that of the second convolution media feature. Similarly, a convolution fusion feature corresponding to each of the M features may be obtained, that is, the number of convolution fusion features may be considered as M. Where # "402 refers to the feature fusion process.
Optionally, the up-sampling fusion processing may be performed based on the nth convolution media feature and the (N-1) th convolution media feature in the N convolution media features, to obtain a first convolution fusion feature. Specifically, the (N-i+1) th convolution fusion feature can be obtained by performing up-sampling fusion processing on the (i) th convolution media feature and the (N-i) th convolution fusion feature in the N convolution media features until i is a default threshold value, so as to obtain M convolution fusion features, wherein the default threshold value can be set as required, such as 2 or 3. i is a positive integer less than or equal to N. Specifically, as shown in fig. 4, an upsampling process is performed on a convolution media feature 401c to obtain an upsampled feature, a feature fusion is performed on a convolution media feature 401b and the upsampled feature of the convolution media feature 401c to obtain a convolution fusion feature 403a, an upsampling process is performed on the convolution fusion feature 403a to obtain an upsampled feature of the convolution fusion feature 403a, and a feature fusion is performed on the upsampled feature of the convolution fusion feature 403a and the convolution media feature 401a to obtain the convolution fusion feature 403. In this example, the M convolution fusion features include convolution fusion feature 403 and convolution fusion feature 403a.
The above processes of performing the upsampling feature fusion processing on the N convolution media features are not limited to other fusion manners, for example, the process may be performed based on the interval fusion between the convolution feature and the convolution media feature, for example, the process of performing the feature fusion between the convolution feature 403a in fig. 4 and the convolution media feature 401 after upsampling, and the like, which is not limited herein.
The feature fusion processing shown in fig. 4 is performed on the N convolution media features and the convolution fusion features, and in fig. 4, the feature fusion module 404 performs feature fusion on the convolution media features and the convolution fusion features, so that media data features corresponding to the object media data can be obtained. And (3) acquiring second sampling parameters corresponding to each convolution fusion feature, and performing up-sampling processing on the convolution fusion features based on the second sampling parameters corresponding to the convolution fusion features to obtain second sampling features, namely obtaining second sampling features corresponding to M convolution fusion features. Optionally, if the convolution size of the convolution feature is the largest among the M convolution features, the second sampling parameter of the convolution feature may be determined to be 1, that is, the convolution feature may not be up-sampled. And fusing the M second sampling features to obtain media data features corresponding to the object media data.
Optionally, a third convolution media feature with the minimum convolution size can be obtained from the N convolution media features, a third sampling parameter of the third convolution media feature is determined based on convolution sizes corresponding to the M convolution fusion features, and up-sampling processing is performed on the third convolution media feature to obtain a third sampling feature, where the second sampling feature corresponding to the M convolution fusion features can be obtained. And carrying out feature fusion processing on the M second sampling features and the third sampling features to obtain media data features corresponding to the object media data. For example, assuming that the convolution size of the convolution media feature 401c is 1/16, the convolution size of the convolution fusion feature 403a is 1/8, and the convolution size of the convolution fusion feature 403 is 1/4, the third sampling parameter of the convolution media feature 401c may be determined to be 4, the second sampling parameter of the convolution fusion feature 403a is 2, the second sampling parameter of the convolution fusion feature 403 is 1, the convolution media feature 401c is up-sampled based on the third sampling parameter of the convolution media feature 401c to obtain an up-sampled feature of the convolution media feature 401c, the convolution fusion feature 403a is up-sampled based on the second sampling parameter of the convolution fusion feature 403a to obtain an up-sampled feature of the convolution fusion feature 403a, the up-sampled feature of the convolution fusion feature 403, the up-sampled feature of the convolution media feature 401c and the up-sampled feature of the convolution fusion feature 403a may be up-sampled by the feature fusion module 404, and the object media data 400 corresponding to the feature data may be obtained by performing feature fusion. By multi-layer convolution and upsampling, higher level semantic features can be obtained. The convolution kernels of each layer of convolution are different in size, the obtained implicit features of the object media data are also different, and the implicit features of the object media data can be obtained more carefully. Through feature extraction fusion under different layers and different dimensions, the features in the object media data which can be represented by the obtained media data features are more comprehensive and rich, that is, the media data features containing more information can be obtained, so that the referenced and used features are more rich in the subsequent object separation process, and the accuracy of object separation is further improved.
It should be appreciated that the process of performing multi-layer convolution on the initial media feature, performing up-sampling feature fusion processing on the N convolved media features, performing feature fusion processing on the N convolved media features and the convolved fusion features, and the like, may be implemented by the MobileNetV and ResNet-18 structures. Optionally, the processes of performing multi-layer convolution on the initial media features, performing up-sampling feature fusion processing on the N convolved media features, performing feature fusion processing on the N convolved media features and the convolved fusion features, and the like can also be realized through DenseNet, resNet-34, resNet-50 and ResNet-101 network structures.
Step S102, extracting features of the media data features by adopting the first channel attention parameter to obtain first type object features corresponding to the media data features.
Specifically, the first channel attention parameter is obtained by training the first type of object. As in fig. 4, a first type object feature 405 may be derived from the feature fusion module based on the first channel attention parameter. The first channel attention parameter may be a parameter for a channel attention convolution (SE-Conv) module. In particular, the media data characteristic may be considered to comprise a media channel characteristic under k channels, i.e. k media channel characteristics constitute the media data characteristic, k being a positive integer. The characteristic difference between the first type object and the second type object in terms of color, font and the like can lead to strong association between a part of channels and the first type object and strong association between a part of channels and the second type object, and the channels strongly associated with the first type object are not identical to the channels strongly associated with the second type object. The first channel attention parameter may be used to represent a first channel weight of the k media channel features, where the first channel attention parameter is a parameter trained for a first type object, and it may be considered that, in the first channel attention parameter, a first channel weight of a channel strongly associated with the first type object is greater than a first channel weight of other channels, and by performing feature extraction on the k media channel features through the first channel attention parameter, features related to the first type object may be enhanced, features unrelated to the first type object or features with a smaller association degree may be weakened, so as to extract a first type object feature corresponding to the first type object in the media data features. For example, the first type of object is a print character and the second type of object is a print character, and the first type of object feature associated with the print character may be extracted from the media data feature by the first channel attention parameter.
Optionally, the number of the first channel attention parameters may be r, where r is a positive integer, that is, r SE-Conv modules may be added to the first identification module, where each SE-Conv module includes one first channel attention parameter, and the media data features are sequentially processed by the r SE-Conv modules to obtain first type object features corresponding to the media data features, that is, the media data features are sequentially extracted by using the r first channel attention parameters to obtain first type object features corresponding to the media data features. The value of r can be set according to the need, for example, r is 1, so that the speed of feature extraction can be increased, namely, the efficiency of feature separation of a first type object and a second type object in the media data features is improved, and when r is greater than 1, the features of the first type object and the features of the second type object in the media data features can be separated and adjusted for multiple times, so that the accuracy of object feature separation is improved. For example, when r is greater than or equal to 2, feature extraction is performed on the attention feature extracted by the attention parameter of the (j-1) th channel by using the attention parameter of the j-th first channel, so as to obtain the attention feature extracted by the attention parameter of the j-th first channel, where j is a positive integer and j is less than or equal to r. And when j is r, obtaining the attention characteristic extracted by the attention parameter of the r first channel, and determining the attention characteristic extracted by the attention parameter of the r first channel as the object characteristic of the first type. And when j is 1, extracting the features of the media data by adopting a first channel attention parameter to obtain the attention features extracted by the first channel attention parameter. As shown in fig. 4, assuming that r is 2, feature extraction is performed on the media data feature by using a first channel attention parameter (i.e., j is 1), so as to obtain an attention feature 417 extracted by the first channel attention parameter, feature extraction is performed on the media data feature by using a second first channel attention parameter (i.e., j is 2), so as to obtain an attention feature 405 extracted by the second first channel attention parameter, where j is r, then the attention feature 405 is determined as a first type object feature.
Step S103, extracting features of the media data features by adopting second channel attention parameters to obtain second type object features corresponding to the media data features, wherein the first channel attention parameters and the second channel attention parameters are obtained by training under different object types.
Specifically, the second channel attention parameter is obtained by training the second type of object. As in fig. 4, a second type object feature 411 may be derived from the feature fusion module based on the second channel attention parameter. The second channel attention parameter may be a parameter for channel attention convolution (squeize-and-ExcitationConvolution, SE-Conv). In particular, the media data characteristic may be considered to comprise a media channel characteristic under k channels, i.e. k media channel characteristics constitute the media data characteristic, k being a positive integer. The characteristic difference between the first type object and the second type object in terms of color, font and the like can lead to strong association between a part of channels and the first type object and strong association between a part of channels and the second type object, and the channels strongly associated with the first type object are not identical to the channels strongly associated with the second type object. The second channel attention parameter may be used to represent a second channel weight for the k media channel features, where the second channel attention parameter is a parameter trained for a second type object, and it may be considered that, in the second channel attention parameter, a second channel weight of a channel strongly associated with the second type object is greater than a second channel weight of other channels, and by performing feature extraction on the k media channel features through the second channel attention parameter, features related to the second type object may be enhanced, features unrelated to the second type object or features with a smaller association degree may be weakened, so as to extract a second type object feature corresponding to the second type object in the media data features. For example, the first type of object is a printed character and the second type of object is a printed character, and the second type of object feature associated with the printed character may be extracted from the media data feature by a second channel attention parameter.
Optionally, the number of the second channel attention parameters may be r, where r is a positive integer, that is, r SE-Conv modules may be added to the second identification module, where each SE-Conv module includes one second channel attention parameter, and the media data features are sequentially processed by the r SE-Conv modules to obtain second type object features corresponding to the media data features, that is, the r second channel attention parameters are sequentially used to perform feature extraction on the media data features, so as to obtain second type object features corresponding to the media data features. The value of r can be set according to the need, for example, r is 1, so that the speed of feature extraction can be increased, namely, the efficiency of feature separation of a first type object and a second type object in the media data features is improved, and when r is greater than 1, the features of the first type object and the features of the second type object in the media data features can be separated and adjusted for multiple times, so that the accuracy of object feature separation is improved. For example, when r is greater than or equal to 2, feature extraction is performed on the attention feature extracted by the attention parameter of the (t-1) th channel by using the attention parameter of the t second channel, so as to obtain the attention feature extracted by the attention parameter of the t second channel, where t is a positive integer and t is less than or equal to r. And when t is r, obtaining the attention characteristic extracted by the attention parameter of the r second channel, and determining the attention characteristic extracted by the attention parameter of the r second channel as the object characteristic of the second type. And when t is 1, extracting features of the media data by adopting the first and second channel attention parameters to obtain the attention features extracted by the first and second channel attention parameters. As shown in fig. 4, assuming that r is 2, feature extraction is performed on the media data feature by using a first second channel attention parameter (i.e., t is 1), so as to obtain an attention feature 418 extracted by the first second channel attention parameter, feature extraction is performed on the media data feature by using a second channel attention parameter (i.e., t is 2), so as to obtain an attention feature 411 extracted by the second channel attention parameter, where t is r, then the attention feature 411 is determined as a second type object feature.
By increasing the first channel attention parameter and the second channel attention parameter, characteristic information such as different colors, fonts and the like of the printed characters and the printed characters can be distinguished, so that 8% Recall index (Recall) and 7% Precision index (Precision) of the printed characters and the printed characters are improved. The first channel attention parameter and the second channel attention parameter are obtained through training and can be used for distinguishing the characteristics of the media data, wherein the characteristics are respectively associated with the first type object and the second type object, so that the purpose of separating the type objects in the object media data, such as separating the printed characters from the printed characters in the medical bill data, is achieved.
Step S104, the first type object feature is identified to obtain the position information of the first type object in the object media data, and the second type object feature is identified to obtain the position information of the second type object in the object media data.
Specifically, feature recognition can be performed on the features of the first type object through the first recognition module, so that the position information of the first type object in the object media data is obtained. And performing feature recognition on the second type object features through a second recognition module to obtain the position information of the second type object in the object media data.
And performing feature preprocessing on the features of the first type object to obtain a first probability distribution diagram and a first threshold mapping diagram corresponding to the first type object in the object media data.
And carrying out binarization processing on the first probability distribution map and the first threshold value mapping map to obtain a first approximate binary map corresponding to the first type object in the object media data.
Position information of the first type object in the object media data is determined according to the first approximate binary image.
And carrying out feature preprocessing on the features of the second type of object to obtain a second probability distribution diagram and a second threshold mapping diagram corresponding to the second type of object in the object media data.
And carrying out binarization processing on the second probability distribution map and the second threshold value mapping map to obtain a second approximate binary map corresponding to the second type object in the object media data.
Position information of the second type object in the object media data is determined based on the second approximate binary image.
Referring to fig. 4, in the first identifying module 410, a first channel attention parameter may be used to perform feature extraction on a media data feature to obtain a first type object feature 405 corresponding to the media data feature, perform feature preprocessing on the first type object feature 405 to obtain a first probability distribution map 406 and a first threshold map 407 corresponding to the first type object in the object media data, perform binarization processing on the first probability distribution map 406 and the first threshold map 407 to obtain a first approximate binary map 408 corresponding to the first type object in the object media data 400, and determine the position information 409 of the first type object in the object media data according to the first approximate binary map 408. The first type object features 405 may be feature identified by the first identification module 410 to obtain location information 409 of the first type object in the object media data.
Referring to fig. 4 again, in the second identifying module 416, feature extraction may be performed on the media data features by using the second channel attention parameter to obtain second type object features 411 corresponding to the media data features, feature preprocessing is performed on the second type object features 411 to obtain a second probability distribution map 412 and a second threshold map 413 corresponding to the second type object in the object media data, binarization processing is performed on the second probability distribution map 412 and the second threshold map 413 to obtain a second approximate binary map 414 corresponding to the second type object in the object media data 400, and location information 415 of the second type object in the object media data is determined according to the second approximate binary map 414. The second type object feature 411 may be feature identified by the second identification module 416 to obtain location information 415 of the second type object in the object media data.
It can be understood that the overall architecture in fig. 4 (i.e. including obtaining the object media data 400, convolving the initial media feature of the object media data 400 to obtain N convolved media features, such as the convolved media feature 401, the convolved media feature 401a, the convolved media feature 401b, and the convolved media feature 401c, performing up-sampling processing and convolution fusion on the convolved media feature, performing feature fusion through the feature fusion module 404, and forming the text detection character separation module by the first recognition module 410 and the second recognition module 416).
In the embodiment of the application, the media data characteristics corresponding to the object media data are acquired, the first channel attention parameter is adopted to perform characteristic extraction on the media data characteristics to obtain the first type object characteristics corresponding to the media data characteristics, the second channel attention parameter is adopted to perform characteristic extraction on the media data characteristics to obtain the second type object characteristics corresponding to the media data characteristics, the first type object characteristics are subjected to characteristic recognition to obtain the position information of the first type object in the object media data, and the second type object characteristics are subjected to characteristic recognition to obtain the position information of the second type object in the object media data. The application can reduce the difficulty of identifying the position information, thereby reducing the difficulty of structuring and extracting the data content information. According to different media data characteristics, the application can combine the characters printed in needle or template format on the medical bill data with the characters printed in color, font and other media data characteristics, separate the characters printed on the medical bill data from the characters printed, thereby reducing the character recognition difficulty caused by overlapping and shielding the characters printed and the characters printed, further reducing the difficulty of structured extraction of the medical bill data content information, and perfecting the complete medical bill data information. By adopting the method and the device, the different position information of the first type object and the second type object can be more accurately identified, and the accuracy of the position information of the separated object media data (such as medical bill data and the like) is improved.
Further, referring to fig. 5, fig. 5 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 5, the data processing method may include at least the following steps S201 to S206.
In step S201, a first sample tag corresponding to the first type object in the sample media data is obtained, and a second sample tag corresponding to the second type object in the sample media data is obtained.
Specifically, the first sample tag may be manually labeled labeling information about the first type of object corresponding to the first type of object in the sample media data. The second sample tag may be manually annotated information about the second type of object corresponding to in the sample media data. The sample media data may be data comprising media data of a first type of object and media data of a second type of object. The sample media data may be invoices, courier tickets, etc. that need to be printed and printed. Referring to fig. 6, fig. 6 is a schematic diagram of a structure of sample media data according to an embodiment of the application. As shown in fig. 6, wherein the first type of object may be a print font portion as shown in fig. 6, and the second type of object may be a print font portion as shown in fig. 6.
Step S202, inputting the sample media data into an initial position identification model, and acquiring sample data characteristics corresponding to the sample media data in the initial position identification model.
Specifically, referring to fig. 7, fig. 7 is a schematic structural diagram of an initial position recognition model according to an embodiment of the present application. As shown in fig. 7, the initial sample feature of the sample media data 503 is obtained, where the sample media data may refer to that the same page contains data of multiple object types, such as invoice information of the user, medical bill data, collection form, and information storage address of the user, and the sample media data carries color or font labels. The initial sample feature may be a feature mapping result of the sample media data, that is, the sample media data is mapped into a feature map to obtain features that can represent semantic information (such as color value information, texture information, content meaning, etc.) of the initial sample media data, in other words, a feature mapping of the sample media data, such as that the sample media data is an image, after feature extraction processing is performed on the sample media data, the image is mapped into features, and the initial sample feature is a feature map for the sample media data. The initial sample features of the sample media data 503 may be convolved with N convolutions to yield N sample media features, such as sample media feature 504, sample media feature 504a, sample media feature 504b, and sample media feature 504c. The up-sampling feature fusion process shown in fig. 7 is performed on the N sample convolution media features (N in fig. 7 is a value of 4, but other values of N are not limited to N), so as to obtain sample fusion features, such as sample fusion features 506 and 506a shown in fig. 7. In particular, the N sample convolution media features may be combined into M sample feature pairs, where M is a positive integer less than N, and optionally each sample feature pair includes two different sample convolution media features, e.g., two adjacent sample convolution media features of the N sample convolution media features may be combined into one sample feature pair, as shown in fig. 7, a sample feature pair consisting of sample convolution media feature 504b and sample convolution media feature 504c, a sample feature pair consisting of sample convolution media feature 504a and sample convolution media feature 504b, and a sample feature pair consisting of sample convolution media feature 504 and sample convolution media feature 504a, that is, the sample feature pairs included in the sample feature pair are adjacent to each other, or a sample feature pair consisting of two sample convolution media features having a space between the N sample convolution media features, such as combining sample convolution media feature 504a and sample convolution media feature 504c into one sample feature pair, without limitation. Here, "+.505 means feature fusion processing.
In fig. 7, the specific acquisition processes of the sample convolution media feature 504, the sample convolution media feature 504a, the sample convolution media feature 504b, the sample convolution media feature 504c, the sample feature fusion process 505a, and the sample fusion feature 506 may refer to the specific acquisition processes of the sample convolution media feature 401, the convolution media feature 401a, the convolution media feature 401b, the convolution media feature 401c, the feature fusion process 402, the convolution fusion feature 403, and the convolution fusion feature 403a in fig. 4, and the detailed description will refer to the corresponding specific descriptions of the relevant convolution media feature 401, the convolution media feature 401a, the convolution media feature 401b, the convolution media feature 401c, the feature fusion process 402, the convolution fusion feature 403, and the convolution fusion feature 403a in fig. 4, which will not be repeated herein.
In step S203, in the initial position recognition model, feature extraction is performed on the sample data features by using the first initial channel attention parameter, so as to obtain first type sample features corresponding to the sample data features.
The specific process of this step may be referred to the specific description of step S102 in the embodiment corresponding to fig. 3a, and will not be described herein.
In step S204, in the initial position recognition model, feature extraction is performed on the sample data features by using the second initial channel attention parameter, so as to obtain second type sample features corresponding to the media data features.
The specific process of this step may be referred to the specific description of step S103 in the embodiment corresponding to fig. 3a, and will not be described herein.
In step S205, in the first type recognition network of the initial position recognition model, the first type sample feature is subjected to feature recognition to obtain the position information of the first type object in the sample media data, and in the second type recognition network of the initial position recognition model, the second type sample feature is subjected to feature recognition to obtain the position information of the second type object in the sample media data.
The specific process of this step may be referred to the specific description of step S104 in the embodiment corresponding to fig. 3a, and will not be described herein.
Referring to fig. 7 together, in the first sample recognition module 500, assuming that the number r of the first initial channel attention parameters is 2, performing feature extraction on the sample data features by using the first initial channel attention parameters to obtain the sample attention features 518 extracted by the first initial channel attention parameters, performing feature extraction on the sample data features by using the second first initial channel attention parameters to obtain the sample attention features 508 extracted by the second first initial channel attention parameters, and determining the sample attention features 508 as the first type sample features. Feature preprocessing is performed on the first type sample features 508 to obtain a first sample probability distribution map 509 and a first sample threshold map 510 corresponding to the first type object in the sample media data, binarization processing is performed on the first sample probability distribution map 509 and the first sample threshold map 510 to obtain a first sample approximate binary map 511 corresponding to the first type object in the sample media data, and position information 512 of the first type object in the sample media data 503 is determined according to the first sample approximate binary map 511. The first type sample features 508 may be feature identified by the first sample identification module 500 to obtain location information 512 of the first type object in the sample media data.
Referring to fig. 7 again, as shown in fig. 7, in the second sample recognition module 501, assuming that the number r of the second initial channel attention parameters is 2, performing feature extraction on the sample data feature by using the first second initial channel attention parameter to obtain a sample attention feature 519 extracted by the first second initial channel attention parameter, performing feature extraction on the sample data feature by using the second initial channel attention parameter to obtain a sample attention feature 513 extracted by the second initial channel attention parameter, and at this time, determining the sample attention feature 513 as a second type sample feature. Feature preprocessing is performed on the second type sample features 513 to obtain a second sample probability distribution map 514 and a second sample threshold map 515 corresponding to the second type object in the sample media data, binarization processing is performed on the second sample probability distribution map 514 and the second sample threshold map 515 to obtain a second sample approximate binary map 516 corresponding to the second type object in the sample media data, and position information 517 of the second type object in the sample media data 503 is determined according to the second sample approximate binary map 516. The second type sample feature 513 may be feature identified by the second sample identification module 501 to obtain location information 517 of the second type object in the sample media data.
In fig. 7, the specific acquisition process of the first type sample feature 508, the first sample probability distribution map 509, the first sample threshold map 510, the first sample approximate binary map 511, the position information 512 of the first type object in the sample media data, the second type sample feature 513, the second sample probability distribution map 514, the second sample threshold map 515, the second sample approximate binary map 516 and the position information 517 of the second type object in the sample media data may refer to fig. 4, in which the specific acquisition process of the first type object feature 405, the first probability distribution map 406, the first threshold map 407, the first approximate binary map 408, the position information 409 of the first type object in the object media data, the specific acquisition process of the second type object feature 411, the second probability distribution map 412, the second threshold map 413, the second approximate binary map 414 and the position information 415 of the second type object in the object media data is described in detail with reference to fig. 4, in which the specific acquisition process of the first type object feature 405, the first probability distribution map 406, the first threshold map 407, the second approximate binary map 408, the second approximate binary map 412, the second approximate binary map 415, the position information of the second type object in the object media data is not described here, the specific acquisition process of the second type object feature 411, the second type object feature 415, the second probability map 412, and the second type object position information in the second type object data is described in detail.
Step S206, according to the position information of the first type object in the sample media data and the first sample tag, and the position information of the second type object in the sample media data and the second sample tag, parameter adjustment is performed on the initial position recognition model to obtain a position recognition model comprising the first channel attention parameter and the second channel attention parameter.
Referring to fig. 7 again, in fig. 7, a first sample loss function is obtained according to the position information of the first type object in the sample media data and the first sample tag. And obtaining a second sample loss function according to the position information of the second type object in the sample media data and the second sample label. Obtaining a sample equalization loss function according to the correlation between the position information of the first type object in the sample media data and the position information of the second type object in the sample media data; the balancing module 502 is configured to perform balancing processing between the first type of object and the second type of object. And carrying out parameter adjustment on the initial position recognition model according to the first sample loss function, the second sample loss function and the sample equalization loss function to obtain a position recognition model comprising a first channel attention parameter and a second channel attention parameter. Through the training process, the first channel attention parameter and the second channel attention parameter obtained through training have the effect shown in fig. 4.
It should be appreciated that the first sample recognition module, the second sample recognition module, and the sample equalization module may be trained in an alternating iterative manner. The alternate iterative approach may include a single module dynamic parameter training approach. The single module dynamic parameter training mode can be to keep the parameters of the second sample identification module and the parameters of the sample equalization module unchanged when the first sample identification module is trained, the single module dynamic parameter training mode can be to keep the parameters of the first sample identification module and the parameters of the sample equalization module unchanged when the second sample identification module is trained, the single module dynamic parameter training mode can be to keep the parameters of the first sample identification module and the parameters of the first sample identification module unchanged when the sample equalization module is trained, and the trained first sample identification module, second sample identification module and sample equalization module can be obtained quickly through the mode of keeping the single module dynamic parameter training. Alternatively, the alternate iterative approach may include a two-module dynamic parameter training approach. The method for training the dynamic parameters of the two modules can be that the first sample identification module and the second sample identification module are trained together to keep the parameters of the sample equalization module unchanged, the method for training the dynamic parameters of the two modules can be that the first sample identification module and the sample equalization module are trained together to keep the parameters of the second sample identification module unchanged, the method for training the dynamic parameters of the two modules can be that the second sample identification module and the sample equalization module are trained together to keep the parameters of the first sample identification module unchanged, the method for training the first sample identification module and the second sample identification module can be that the first sample identification module and the sample equalization module are trained in a countermeasure network mode, the method for training the first sample identification module and the sample equalization module can be that weight distribution is equal, the method for training the second sample identification module and the sample equalization module can be that weight distribution is equal, and the initial weight distribution can be that weight distribution is equal. The first sample identification module, the second sample identification module and the sample equalization module with stronger association degree can be obtained through the training of the dynamic parameters of the two modules.
According to the position information of the first type object in the sample media data and the first sample label, a first sample loss function is obtained, and the specific reference can be seen from a formula ①:
L1=LS+α×Lb+β×Lt ①
Wherein in formula ①, L S represents the loss function of the first probability map, L t represents the loss function of the first threshold map, L b represents the loss function of the first approximate binary map, α represents the weight portion for the first approximate binary map, for example, the value range of α may be [0.8-1.2], and β may represent the weight portion for the first threshold map, for example, the value range of β may be [0.8-1.2].
The loss function of the first probability distribution map and the loss function of the first approximate binary map may be binary cross entropy, which may be specifically shown in formula ②:
Where in formula ②, S i may be sample media data, x i may be a first type sample feature, and y i may be location information of a first type object in the sample media data.
The loss function of the first threshold map may be represented by equation ③:
Where, in formula ③, R d may represent a thresholded offset, x i * may be a corresponding feature of the sample data feature in the first threshold map, and y i * may represent the first threshold annotation result.
According to the position information of the second type object in the sample media data and the second sample label, a second sample loss function is obtained, and the specific example can be seen from a formula ④:
L2=LS+α×Lb+β×Lt (④
wherein in formula ④, L S represents a loss function of the second probability distribution map, L t represents a loss function of the second threshold map, L b represents a loss function of the second approximate binary map, α represents a weight portion for the second approximate binary map, a value range of α may be [0.8-1.2], β may represent a weight portion for the second threshold map, and a value range of β may be [0.8-1.2].
The loss function of the second probability distribution map and the loss function of the second approximate binary map may be binary cross entropy, which may be specifically shown in formula ⑤:
Where in formula ⑤, S i may be sample media data, x i may be a second type sample feature, and y i may be location information of a second type object in the sample media data.
According to the correlation between the position information of the first type object in the sample media data and the position information of the second type object in the sample media data, a sample equalization loss function is obtained, which can be shown in a formula ⑥:
L3=γL1+(1-γ)×L2+Lf ⑥
Where in equation ⑥, L 1 may represent a first sample loss function, L 2 a second sample loss function, γ may represent the weight of the first sample loss function, (1- γ) may represent the weight of the second sample loss function, and L f may represent the base equalization loss function when the sample is empty.
The basic equalization loss function may be found as shown in equation ⑦:
Lf=-αt(1-pt)δlog(pt) ⑦
In the formula ⑦, p t represents the probability that the sample media data is the first type object, α t represents the equalization coefficient of different samples, the value range of α t may be [0.1-0.5], δ represents the mean square number, and the value range of δ may be [1-2].
Referring to fig. 8a, fig. 8a is a data structure diagram of a basic equalization loss function according to an embodiment of the present application. As shown in fig. 8a, the abscissa of the graph represents the probability that the sample media data is the first type object, the ordinate of the graph represents the loss value of the basic equalization loss function, and when the value range of the mean square number delta is [1-2], the performance of the basic equalization loss function is relatively balanced for simple samples with the prediction probability greater than 0.5 and difficult samples with the probability less than or equal to 0.5.
And according to the first sample loss function, the second sample loss function and the sample equalization loss function, carrying out parameter adjustment on the first initial channel attention parameter and the second initial channel attention parameter to obtain a first channel attention parameter, a second channel attention parameter and a position identification model comprising the first channel attention parameter and the second channel attention parameter.
The sample training test results can be aligned as shown in table 1:
TABLE 1
Wherein, table 1 can indicate that the channel attention convolution module and the equalization module both help to improve the separation effect of the first type of object and the second type of object. The application adopts two independent channel attention convolution modules which are respectively used for acquiring the media convolution characteristics of the printed characters and the media convolution characteristics of the printed characters, and is different from the single convolution modules of other deep learning neural networks. The first channel attention parameter may be a parameter configured to obtain printed characters, the first channel attention parameter having a weight of 95% for obtaining printed characters and a weight of 5% for obtaining printed characters. The second channel attention parameter may be a parameter configured to acquire printed characters, the second channel attention parameter having a weight of 95% for acquiring printed characters and a weight of 5% for acquiring printed characters. By distinguishing media data characteristics in terms of different colors, fonts, etc. for both printed characters and printed characters, 4% Recall index (Recall) and 8% Precision index (Precision) for typewriters and printed characters are improved. The application has extremely strong generalization, can carry out character separation processing on different medical bill data, can well separate the category and position information of printed characters and machine-input characters by utilizing a character separation model network through detecting the text of the medical bill data, reduces the character recognition difficulty of different subsequent characters, improves the correctness of identifying the related content information of the medical bill data, further reduces the extraction difficulty of the structured information of the medical bill data, accelerates the structured output of field information formed by characters, and greatly reduces the input time of manual examination. The application also adds a sample balancing module (FocalLoss) to balance samples, and further improves the 3% Recall index (Recall) and the 2% Precision index (Precision) of the characters.
The application is different from the traditional method of adding a layer separation module or other modularization in the system, and carrying out the structured extraction of the initial media data in the system aiming at bill content.
In addition, the application adopts the characteristic layer of the public backbone network, reduces the time consumption of the whole module, ensures that the network prediction time can be limited within 500ms, is used for shortening the network prediction time, and can separate the printed characters from the printed characters in the bill pictures in real time.
Referring again to fig. 7, where 500 may represent a training portion for a first sample loss function, 501 may represent a training portion for a second sample loss function, and 502 may represent a training portion for a sample equalization loss function.
In the embodiment of the application, the first sample loss function, the second sample loss function and the balanced sample loss function are introduced, and the position identification training can be performed according to the first sample loss function, the second sample loss function and the balanced sample loss function. When the system is actually tested, the character separation model which is trained by other preset scenes can be adopted to carry out fine adjustment on the character separation model, so that the system can be suitable for application of a new scene, the early preparation time of the character separation system is saved, the identification of the position area of the printed character and the position area of the printed character is more efficiently and conveniently carried out, and the printed character are separated. According to different media data characteristics, the application can combine the characters printed in needle or template format on the medical bill data with the characters printed in color, font and other media data characteristics, separate the characters printed on the medical bill data from the characters printed, thereby reducing the character recognition difficulty caused by overlapping and shielding the characters printed and the characters printed, further reducing the difficulty of structured extraction of the medical bill data content information, and perfecting the complete medical bill data information. By adopting the method and the device, the different position information of the first type object and the second type object can be more accurately identified, and the accuracy of the position information of the separated object media data (such as medical bill data and the like) is improved.
Referring to fig. 8b, fig. 8b is a schematic flow chart for separating position information according to an embodiment of the application. As shown in fig. 8b, initial media data may be input, and target media data may be obtained through direction adjustment. And extracting the characteristics of different channels according to the first channel attention parameter and the second channel attention parameter to obtain the first type object characteristics and the second type object characteristics. And inputting the characteristics of the first type object and the characteristics of the second type object into a balance module for characteristic balance correction, so that the position information of the first type object in the object media data and the position information of the second type object in the object media data can be obtained.
Referring to fig. 8c, fig. 8c is a schematic flow chart for separating test position information according to an embodiment of the application. As shown in fig. 8c, initial media data for testing may be input, and the target media data for testing may be obtained through direction adjustment. And performing feature detection of different channels according to the first channel attention parameter and the second channel attention parameter to obtain a first type object feature for testing and a second type object feature for testing. From the first type of object feature for testing and the second type of object feature for testing, position information of the first type of object for testing in the object media data and position information of the second type of object for testing in the object media data can be obtained.
Further, referring to fig. 9a, fig. 9a is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus may be a computer program (comprising program code) running in a computer device, for example, the data processing apparatus being an application software, the apparatus being adapted to perform the corresponding steps of the method provided by the embodiments of the application. As shown in fig. 9a, the data processing apparatus 1 may include an object feature acquisition module 11, a first feature extraction module 12, a second feature extraction module 13, and a first feature recognition module 14.
An object feature obtaining module 11, configured to obtain a media data feature corresponding to the object media data;
A first feature extraction module 12, configured to perform feature extraction on the media data feature by using a first channel attention parameter, so as to obtain a first type object feature corresponding to the media data feature;
the second feature extraction module 13 is configured to perform feature extraction on the media data feature by using a second channel attention parameter to obtain a second type object feature corresponding to the media data feature;
The first feature recognition module 14 is configured to perform feature recognition on the first type object feature to obtain location information of the first type object in the object media data, and input the second type object feature into the location recognition model to perform feature recognition to obtain location information of the second type object in the object media data.
The specific functional implementation manners of the object feature obtaining module 11, the first feature extracting module 12, the second feature extracting module 13, and the first feature identifying module 14 may be referred to step S101-step S104 in the corresponding embodiment of fig. 3a, and will not be described herein.
The data processing device 1 further comprises a semantic analysis module 15 and a direction adjustment module 16.
The semantic analysis module 15 is configured to obtain initial media data, perform semantic analysis on the initial media data, and obtain content direction information in the initial media data;
the direction adjustment module 16 is configured to perform direction adjustment on the initial media data based on the content direction information, so as to obtain target media data.
The specific functional implementation manner of the semantic parsing module 15 and the direction adjusting module 16 may refer to step S101 in the corresponding embodiment of fig. 3a, and will not be described herein.
Referring again to fig. 9a, the semantic parsing module 15 includes:
An activating unit 151, configured to perform convolution and activation processing on the initial media data, so as to obtain a preliminary activation feature corresponding to the initial media data;
a second converting unit 152, configured to perform feature conversion processing on the preliminary activation feature based on the hourglass layer, so as to obtain a media adjustment feature;
the pooling unit 153 is configured to pool the media adjustment feature to obtain a pooled feature, and perform full-connection prediction processing on the pooled feature to obtain content direction information in the initial media data.
The specific functional implementation manner of the activating unit 151, the second converting unit 152, and the pooling unit 153 may refer to step S101 in the corresponding embodiment of fig. 3a, and will not be described herein.
Referring to fig. 9a, the object feature obtaining module 11 includes:
An obtaining unit 111, configured to obtain object media data, and perform feature extraction processing on the object media data to obtain initial media features of the object media data;
The first conversion unit 112 is configured to perform feature conversion on the initial media feature to obtain a media data feature corresponding to the object media data.
The specific functional implementation manner of the obtaining unit 111 and the first converting unit 112 may refer to step S101 in the corresponding embodiment of fig. 3a, which is not described herein.
Referring again to fig. 9a, the convolution unit 112 includes:
a convolution subunit 1121, configured to perform multi-layer convolution on the initial media feature to obtain N convolved media features;
an upsampling fusion subunit 1122, configured to perform upsampling feature fusion processing on the N convolutionally media features to obtain convolutionally fused features;
and the feature fusion subunit 1123 is configured to perform feature fusion processing on the N convolution media features and the convolution fusion features, so as to obtain a media data feature corresponding to the object media data.
The specific functional implementation manner of the convolution subunit 1121, the upsampling fusion subunit 1122, and the feature fusion subunit 1123 may be referred to step S101 in the corresponding embodiment of fig. 3a, which is not described herein.
Referring again to fig. 9a, the first feature recognition module 14 further includes:
a first preprocessing unit 141, configured to perform feature preprocessing on features of a first type of object, so as to obtain a first probability distribution map and a first threshold mapping map corresponding to the first type of object in the object media data;
a first mapping unit 142, configured to binarize the first probability distribution map and the first threshold map to obtain a first approximate binary map corresponding to the first type object in the object media data;
a first determining unit 143 for determining position information of the first type object in the object media data according to the first approximate binary image;
a second preprocessing unit 144, configured to perform feature preprocessing on the features of the second type object, so as to obtain a second probability distribution map and a second threshold mapping map corresponding to the second type object in the object media data;
A second mapping unit 145, configured to binarize the second probability distribution map and the second threshold map to obtain a second approximate binary map corresponding to the second type object in the object media data;
A second determining unit 146 for determining the position information of the second type object in the object media data based on the second approximate binary image.
The specific functional implementation manner of the first preprocessing unit 141, the first mapping unit 142, the first determining unit 143, the second preprocessing unit 144, the second mapping unit 145, and the second determining unit 146 may refer to step S104 in the corresponding embodiment of fig. 3a, and will not be described herein.
In the embodiment of the application, the first sample loss function, the second sample loss function and the balanced sample loss function are introduced, and the position identification training can be performed according to the first sample loss function, the second sample loss function and the balanced sample loss function. The application can more accurately identify the different position information of the first type object and the second type object, and improve the accuracy of the position information of the separated object media data (such as medical bill data and the like). The application can effectively learn the concentrated and stably distributed character printing media data characteristics and character printing media data characteristics aiming at the medical bill data in different areas. When the system is actually tested, the character separation model which is trained by other preset scenes can be adopted to carry out fine adjustment on the character separation model, so that the system can be suitable for application of a new scene, the early preparation time of the character separation system is saved, the identification of the position area of the printed character and the position area of the printed character is more efficiently and conveniently carried out, and the printed character are separated. Further, referring to fig. 9b, fig. 9b is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus may be a computer program (comprising program code) running in a computer device, for example, the data processing apparatus being an application software, the apparatus being adapted to perform the corresponding steps of the method provided by the embodiments of the application. As shown in fig. 9b, the data processing apparatus 2 may include a sample tag acquisition module 21, a sample feature acquisition module 22, a third feature extraction module 23, a fourth feature extraction module 24, a second feature recognition module 25, and a model determination module 26.
A sample tag obtaining module 21, configured to obtain a first sample tag corresponding to a first type object in sample media data, and obtain a second sample tag corresponding to a second type object in sample media data;
the sample feature obtaining module 22 is configured to input sample media data into an initial position identification model, and obtain sample data features corresponding to the sample media data in the initial position identification model;
the third feature extraction module 23 is configured to perform feature extraction on the sample data features by using the first initial channel attention parameter in the initial position identification model, so as to obtain first type sample features corresponding to the sample data features;
A fourth feature extraction module 24, configured to perform feature extraction on the sample data features by using a second initial channel attention parameter in the initial position identification model, so as to obtain second type sample features corresponding to the media data features;
The second feature recognition module 25 is configured to perform feature recognition on the first type sample feature in a first type recognition network of the initial position recognition model to obtain position information of the first type object in the sample media data;
The model determining module 26 is configured to perform parameter adjustment on the initial position identification model according to the position information of the first type object in the sample media data and the first sample tag, and the position information of the second type object in the sample media data and the second sample tag, so as to obtain a position identification model including the first channel attention parameter and the second channel attention parameter.
The specific functional implementation manners of the sample tag obtaining module 21, the sample feature obtaining module 22, the third feature extracting module 23, the fourth feature extracting module 24, the second feature identifying module 25, and the model determining module 26 may be referred to the step S201-step S206 in the corresponding embodiment of fig. 5, and will not be described herein.
Referring again to fig. 9b, the model determining module 26 includes:
A first function obtaining unit 261, configured to obtain a first sample loss function according to the position information of the first type object in the sample media data and the first sample tag;
A second function obtaining unit 262, configured to obtain a second sample loss function according to the position information of the second type object in the sample media data and the second sample tag;
An equalization function obtaining unit 263, configured to obtain a sample equalization loss function according to a correlation between the position information of the first type object in the sample media data and the position information of the second type object in the sample media data;
The model determining unit 264 is configured to perform parameter adjustment on the initial position identification model according to the first sample loss function, the second sample loss function, and the sample equalization loss function, so as to obtain a position identification model including a first channel attention parameter and a second channel attention parameter.
The specific function implementation manners of the first function obtaining unit 261, the second function obtaining unit 262, the equalization function obtaining unit 263 and the model determining unit 264 may refer to step S206 in the corresponding embodiment of fig. 5, and are not described herein.
In the embodiment of the application, the first sample loss function, the second sample loss function and the balanced sample loss function are introduced, and the position recognition training can be performed according to the first sample loss function, the second sample loss function and the balanced sample loss function. By adopting the method and the device, the different position information of the first type object and the second type object can be more accurately identified, and the accuracy of the position information of the separated object media data (such as medical bill data and the like) is improved. According to different media data characteristics, the application can combine the characters printed in needle or template format on the medical bill data with the characters printed in color, font and other media data characteristics, separate the characters printed on the medical bill data from the characters printed, thereby reducing the character recognition difficulty caused by overlapping and shielding the characters printed and the characters printed, further reducing the difficulty of structured extraction of the medical bill data content information, and perfecting the complete medical bill data information.
Further, referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 10, the computer device 1000 may include at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 10, the memory 1005, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application.
In the computer device 1000 shown in fig. 10, the network interface 1004 may provide network communication functions, while the user interface 1003 is mainly used as an interface for providing input to a user, and the processor 1001 may be used to invoke a device control application program stored in the memory 1005 to realize:
The method comprises the steps of obtaining media data characteristics corresponding to object media data, carrying out characteristic extraction on the media data characteristics by adopting a first channel attention parameter to obtain first type object characteristics corresponding to the media data characteristics, carrying out characteristic extraction on the media data characteristics by adopting a second channel attention parameter to obtain second type object characteristics corresponding to the media data characteristics, training the first channel attention parameter and the second channel attention parameter under different object types, carrying out characteristic identification on the first type object characteristics to obtain position information of the first type object in the object media data, and carrying out characteristic identification on the second type object characteristics to obtain position information of the second type object in the object media data.
In one embodiment, the processor 1001 further performs the steps of:
The method comprises the steps of obtaining initial media data, carrying out semantic analysis on the initial media data to obtain content direction information in the initial media data, and carrying out direction adjustment on the initial media data based on the content direction information to obtain object media data.
In one embodiment, the processor 1001 further performs the following steps when performing semantic parsing on the initial media data to obtain content direction information in the initial media data:
The method comprises the steps of carrying out convolution and activation processing on initial media data to obtain initial activation characteristics corresponding to the initial media data, carrying out characteristic conversion processing on the initial activation characteristics based on an hourglass layer to obtain media adjustment characteristics, carrying out pooling processing on the media adjustment characteristics to obtain pooling characteristics, and carrying out full-connection prediction processing on the pooling characteristics to obtain content direction information in the initial media data.
In one embodiment, the processor 1001 further performs the following steps when the media data feature corresponding to the object media data is to be acquired:
The method comprises the steps of obtaining object media data, carrying out feature extraction processing on the object media data to obtain initial media features of the object media data, and carrying out feature conversion on the initial media features to obtain media data features corresponding to the object media data.
In one embodiment, when the processor 1001 performs feature conversion on the initial media feature to obtain the media data feature corresponding to the object media data, the following steps are further performed:
The method comprises the steps of carrying out multi-layer convolution on initial media characteristics to obtain N convolution media characteristics, carrying out up-sampling characteristic fusion processing on the N convolution media characteristics to obtain convolution fusion characteristics, and carrying out characteristic fusion processing on the N convolution media characteristics and the convolution fusion characteristics to obtain media data characteristics corresponding to object media data.
In one embodiment, when the processor 1001 performs feature recognition on the first type of object feature to obtain the location information of the first type of object in the object media data, and performs feature recognition on the second type of object feature to obtain the location information of the second type of object in the object media data, the following steps are further performed:
The method comprises the steps of performing feature preprocessing on first type object features to obtain a first probability distribution map and a first threshold value mapping map corresponding to the first type object in object media data, performing binarization processing on the first probability distribution map and the first threshold value mapping map to obtain a first approximate binary map corresponding to the first type object in the object media data, determining position information of the first type object in the object media data according to the first approximate binary map, performing feature preprocessing on second type object features to obtain a second probability distribution map and a second threshold value mapping map corresponding to the second type object in the object media data, performing binarization processing on the second probability distribution map and the second threshold value mapping map to obtain a second approximate binary map corresponding to the second type object in the object media data, and determining position information of the second type object in the object media data according to the second approximate binary map.
In one embodiment, the processor 1001 obtains a first sample tag corresponding to a first type of object in sample media data and obtains a second sample tag corresponding to a second type of object in sample media data; the method comprises the steps of inputting sample media data into an initial position recognition model, obtaining sample data characteristics corresponding to the sample media data in the initial position recognition model, carrying out characteristic extraction on the sample data characteristics by adopting a first initial channel attention parameter in the initial position recognition model to obtain first type sample characteristics corresponding to the sample data characteristics, carrying out characteristic extraction on the sample data characteristics by adopting a second initial channel attention parameter in the initial position recognition model to obtain second type sample characteristics corresponding to the media data characteristics, carrying out characteristic recognition on the first type sample characteristics in a first type recognition network of the initial position recognition model to obtain position information of a first type object in the sample media data, carrying out characteristic recognition on the second type sample characteristics in a second type recognition network of the initial position recognition model to obtain position information of a second type object in the sample media data, and carrying out parameter adjustment on the initial position recognition model according to the position information of the first type object in the sample media data and a first sample tag and the position information of the second type object in the sample media data and the second sample tag to obtain position information of the first channel attention parameter and the second channel attention parameter recognition model.
In one embodiment, the processor 1001 performs the following steps when performing parameter adjustment on the initial position identification model according to the position information of the first type object in the sample media data and the first sample tag, and the position information of the second type object in the sample media data and the second sample tag, to obtain the position identification model including the first channel attention parameter and the second channel attention parameter:
Obtaining a first sample loss function according to the position information of a first type object in sample media data and a first sample label, obtaining a second sample loss function according to the position information of a second type object in the sample media data and a second sample label, obtaining a sample equilibrium loss function according to the relevance between the position information of the first type object in the sample media data and the position information of the second type object in the sample media data, and carrying out parameter adjustment on an initial position identification model according to the first sample loss function, the second sample loss function and the sample equilibrium loss function to obtain a position identification model comprising a first channel attention parameter and a second channel attention parameter.
It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments corresponding to fig. 2, 3a, 3b, 4, 5, 6, 7, 8a, 8b and 8c, the description of the data processing apparatus 1 in the embodiments corresponding to fig. 9a, and the description of the data processing apparatus 2 in the embodiments corresponding to fig. 9b, which will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted.
Embodiments of the present application further provide a computer readable storage medium storing a computer program, where the computer program includes program instructions, where the program instructions implement, when executed by a processor, a data processing method provided by each step in fig. 2, fig. 3a, fig. 3b, fig. 4, fig. 5, fig. 6, fig. 7, fig. 8a, fig. 8b, and fig. 8c, and specifically refer to an implementation provided by each step in fig. 2, fig. 3a, fig. 3b, fig. 4, fig. 5, fig. 6, fig. 7, fig. 8a, fig. 8b, and fig. 8c, which will not be described herein. In addition, the description of the beneficial effects of the same method is omitted.
The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can execute the data processing method in the embodiments corresponding to fig. 2, 3a, 3b, 4, 5, 6, 7, 8a, 8b, and 8c, which are not described herein. In addition, the description of the beneficial effects of the same method is omitted.
The term "comprising" and any variations thereof in the description of embodiments of the application and in the claims and drawings is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.