CN116469121A

CN116469121A - Learning object recognition method, device, equipment and storage medium

Info

Publication number: CN116469121A
Application number: CN202310604922.2A
Authority: CN
Inventors: 兴百桥
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-07-21

Abstract

The present disclosure relates to a learning object recognition method, apparatus, device, and storage medium. The method comprises the following steps: acquiring a target page image containing a learning object; performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of a learning object; performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object; based on the first type, the target feature vector, and the second type, a target type of the learning object in the target page image is determined. Therefore, for the page images containing the learning objects, the recognition accuracy of the learning objects, especially the page images with poor quality, is improved, and/or the page images containing a plurality of different learning objects with small differences can also be accurately recognized, so that the types of the learning objects can be accurately recognized, and further, the explanation videos matched with the learning objects can be loaded, and the learning interest of students is improved.

Description

Learning object recognition method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a learning object recognition method, device, equipment and storage medium.

Background

When a student learns new knowledge in a classroom or by using amateur time, in order to improve the learning interest of the student, the student can shoot pages containing learning objects by using a learning machine, and identify the learning objects from the shot page images by using the learning machine, and further load an explanation video of the learning objects to explain the student.

In order to identify a learning object from a page image, the related art adopts a multi-model fusion method to identify a learning object from a page image. However, in many cases, the type of the learning object cannot be accurately identified by using the multi-model fusion method, especially when the quality of the page image shot by the learning machine is poor and/or when the difference between different learning objects contained in the page image is small, the type of the learning object is difficult to accurately identify by using the multi-model fusion method, so that the problem that the loaded interpretation video is not matched with the learning object occurs. Thus, it is necessary to provide a learning object recognition method with high accuracy.

Disclosure of Invention

In order to solve the technical problems, the present disclosure provides a learning object recognition method, a learning object recognition device, and a storage medium.

In a first aspect, the present disclosure provides a learning object recognition method, the method including:

Acquiring a target page image containing a learning object;

performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of the learning object;

performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object;

and determining a target type of the learning object in the target page image based on the first type, the target feature vector and the second type.

In a second aspect, the present disclosure provides a learning object recognition apparatus, the apparatus including:

the image acquisition module is used for acquiring a target page image containing a learning object;

the object recognition module is used for carrying out object recognition processing on the target page image by utilizing a pre-trained object recognition model and determining a first type of the learning object;

the feature extraction module is used for carrying out feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object;

and the type determining module is used for determining the target type of the learning object in the target page image based on the first type, the target feature vector and the second type.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the method provided in the first aspect.

In a fourth aspect, embodiments of the present disclosure further provide a computer readable storage medium having a computer program stored thereon, wherein the storage medium stores the computer program, which when executed by a processor, causes the processor to implement the method provided in the first aspect above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the embodiment of the disclosure relates to a learning object identification method, a learning object identification device, learning object identification equipment and a storage medium, wherein a target page image containing a learning object is acquired; performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of a learning object; performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object; the target type of the learning object in the target page image is determined based on the first type, the target feature vector, and the second type of the learning object. Therefore, for the page images containing the learning objects, object recognition and feature extraction can be performed on the target page images by using two different models, and the final category of the learning objects is determined by combining the object recognition result and the feature extraction result, so that the recognition precision of the learning objects is improved, and for page images with poor quality and/or page images containing a plurality of different learning objects with small differences, the types of the learning objects can be accurately recognized, and further the explanation videos matched with the learning objects are loaded, so that the learning interest of students is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of a learning object recognition method according to an embodiment of the disclosure;

fig. 2 is a schematic flow chart of S140 in fig. 1 according to an embodiment of the disclosure;

fig. 3 is a schematic flow chart of S220 in fig. 2 according to an embodiment of the disclosure;

fig. 4 is a flowchart of another learning object recognition method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a learning object recognition device according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

In order to improve recognition accuracy of a learning object, a learning object recognition method provided by an embodiment of the present disclosure is described below with reference to fig. 1 to 4. In the embodiment of the disclosure, the learning object recognition method may be performed by an electronic device or a server. The electronic device may include a mobile phone, a tablet computer, a desktop computer, a notebook computer, and other devices with communication functions. The server may be a cloud server or a server cluster, or other devices with storage and computing functions. Note that the following embodiments are exemplarily explained with the electronic device as an execution subject.

Fig. 1 shows a flowchart of a learning object recognition method according to an embodiment of the present disclosure.

As shown in fig. 1, the learning object recognition method may include the following steps.

S110, acquiring a target page image containing a learning object.

In this embodiment, when a student shoots a page image including a learning object with an electronic device (e.g., a learning machine), the page image is taken as a target page image, or after preprocessing and page detection is performed on the page image, a target page image is generated to identify the type of the learning object from the target page image.

In some embodiments, the resolution corresponding to the target page image is greater than or equal to a preset resolution threshold, that is, the target page image is a high-resolution image, and for such an image, the recognition difficulty of the learning object class is low, and the type recognition may be performed by using a multi-model fusion method or a recognition method described in the following embodiments.

In other embodiments, the resolution corresponding to the target page image is smaller than the preset resolution threshold, that is, the target page image is an image with lower resolution, and for such an image, the recognition difficulty of the category of the learning object is higher, and it is difficult to accurately recognize the category of the learning object from the target page image by adopting the multi-model fusion method, and the recognition method described in the following embodiment needs to be adopted for type recognition.

In still other embodiments, the number of learning objects included in the target page image is plural, and the similarity between any two learning objects is less than or equal to a preset similarity threshold, that is, the difference between the plurality of learning objects included in the target page image is larger, and for such images, the recognition difficulty of the learning object class is lower, and the type recognition may be performed by using a multi-model fusion method or a recognition method described in the following embodiments.

In still other embodiments, the number of learning objects included in the target page image is plural, and the similarity between any two learning objects is greater than a preset similarity threshold, that is, the difference between the plurality of learning objects included in the target page image is small, and for such images, the difficulty in identifying the category of the learning objects is large, and it is necessary to use the identification method described in the following embodiment to perform type identification.

In other embodiments, the target page image may also be a low-resolution image, and include a plurality of learning objects with similarity greater than a preset similarity threshold, so that the difficulty in identifying such images is further improved, and the type identification needs to be performed by using the identification method described below in this embodiment.

Alternatively, the learning objects in the target page image may include, but are not limited to, celestial objects, flowers, trees, and other objects, without limitation.

In this embodiment, optionally, S110 specifically includes: acquiring an initial page image; performing four-point frame detection processing on the initial page image by using a pre-trained page detection model to obtain page position information of the initial page image; cutting out candidate page images from the initial page images according to page position information; correcting the initial shape of the pages in the candidate page images to be a target shape to obtain target page images.

Specifically, scaling an initial page image to a fixed size and performing image data standardization processing to obtain a preprocessed initial page image, detecting position coordinate information of page content in the image from the preprocessed initial page image by using a page detection model to obtain page position information, further cutting an image containing page content from the preprocessed initial page image based on the page position information to serve as a candidate page image, and finally correcting the page shape in the candidate page image to be a normal shape to obtain a target page image.

The initial page image may be a page image photographed by an electronic device such as a learning machine. The page detection model may include, but is not limited to, a Mask R-CNN network or the like. The page position information refers to position coordinate information of page contents in an image. In general, the normal shape of a page is rectangular, and the page becomes trapezoidal due to photographing, so that the initial shape of the page in an image is trapezoidal, whereas the target shape of the page in the image is rectangular, and in order to improve the recognition accuracy of a learning object, it is necessary to correct the page in the trapezoidal shape to a page in the rectangular shape.

Therefore, the target page content containing the normal-shape pages is obtained by performing page detection processing and page correction processing on the shot images, and the learning accuracy of a learning object is further improved.

S120, performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of a learning object.

In some embodiments, for a target page image containing multiple learning objects with small differences, and/or for a target page image with low resolution, the type recognition difficulty is greater, and the object recognition model needs to pay attention to strong features (e.g., difference features) and weaken weak features (e.g., the same or similar features) of different objects to improve the recognition accuracy of the category. Correspondingly, S120 specifically includes: based on a self-attention network in the object recognition model, performing self-attention processing on the target page image according to the target coefficient to obtain a first self-attention characteristic; and classifying the first self-attention characteristic based on a classification network in the object recognition model to obtain a first type.

Specifically, focusing on the strong features of different objects can be understood as amplifying the strong features of different objects, and specifically taking the target coefficient as the weight of the strong features to amplify the strong features; weakening the weak features of different objects can be understood as processing the weak features of different objects into weaker features, specifically using the target coefficients as weights of the weak features to process the weak features. In this case, the target coefficient is y=f (W/(max (W) -W, W)) ×x+b), Y is an output value of a neuron of the self-attention network, f (×) is an activation function, W is a weight of a connection of the neuron of the self-attention network and a neuron of the classification network, X is an input value of the neuron of the self-attention network, and b is a bias value. Therefore, for the target page image with large type recognition difficulty, the strong features can be amplified by using the target coefficients, and the weak features can be reduced, so that the recognition accuracy of the learning object is improved.

In other embodiments, for a target page image including a plurality of learning objects with larger differences, and/or for a target page image with higher resolution, the type recognition difficulty is smaller, and the object recognition model can ensure the recognition accuracy of the category without focusing on strong features and weakening weak features of different objects. Correspondingly, S120 specifically includes: based on a self-attention network in the object recognition model, performing self-attention processing on the target page image according to the target coefficient to obtain a first self-attention characteristic; and classifying the first self-attention characteristic based on a classification network in the object recognition model to obtain a first type.

In this case, the target coefficient is y=f (w×x+b), Y is an output value of a neuron of the self-attention network, f (X) is an activation function, W is a weight of a connection of the neuron of the self-attention network with a neuron of the classification network, X is an input value of the neuron of the self-attention network, and b is a bias value. Therefore, for the target page image with small type recognition difficulty, the recognition accuracy of the learning object can be ensured based on the target coefficient without amplifying strong features and shrinking weak features.

Alternatively, the object recognition model may include, but is not limited to, a Resnet50, VGG, vit, or the like model. The self-attention network may in particular be a previous layer of the first classification network.

In order to improve the recognition effect of the model, before the target page image is input into the object recognition model, preprocessing operations such as scaling to a standard size, image data standardization and the like can be performed on the target page image, so that the object recognition model performs self-attention processing on the preprocessed target page image, and further, a first classification network in the object recognition model is utilized to determine a classification label of a learning object, and the meaning represented by the classification label is used as a first type.

In conclusion, for target page images with different recognition difficulties, object recognition models with different weights are used for type recognition, so that the method is suitable for the situation of type recognition of images in different scenes.

And S130, performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object.

In some embodiments, for a target page image including a plurality of learning objects with smaller differences, and/or for a target page image with lower resolution, the type recognition difficulty is greater, so that the object recognition model needs to pay attention to strong features and weaken weak features of different objects, so as to improve the recognition accuracy of the category. Correspondingly, S130 specifically includes: based on the self-attention network in the feature vector extraction model, performing self-attention processing on the target page image according to the target coefficient to obtain a second self-attention feature; and classifying the second self-attention feature based on a classification network in the feature vector extraction model to obtain a target feature vector and a second type of the learning object.

In this case, the target coefficient is y=f (W/(max (W) -W, W)) ×x+b), Y is an output value of a neuron of the self-attention network, f (×) is an activation function, W is a weight of a connection of the neuron of the self-attention network and a neuron of the classification network, X is an input value of the neuron of the self-attention network, and b is a bias value. Specifically, the target coefficient is used as the weight of the strong feature to amplify the strong feature, and the target coefficient is used as the weight of the weak feature to process the weak feature. Therefore, for the target page image with large type recognition difficulty, the strong features can be amplified by using the target coefficients, and the weak features can be reduced, so that the recognition accuracy of the learning object is improved.

In other embodiments, for a target page image including a plurality of learning objects with larger differences, and/or for a target page image with higher resolution, the type recognition difficulty is smaller, and the object recognition model can ensure the recognition accuracy of the category without focusing on strong features and weakening weak features of different objects. Correspondingly, S130 specifically includes: based on the self-attention network in the feature vector extraction model, performing self-attention processing on the target page image according to the target coefficient to obtain a second self-attention feature; and classifying the second self-attention feature based on a classification network in the feature vector extraction model to obtain a target feature vector and a second type of the learning object.

In this case, the second coefficient is y=f (w×x+b), Y is an output value of the neurons of the self-attention network, f (X) is an activation function, W is a weight of the connection of the neurons of the self-attention network with the neurons of the classification network, X is an input-to-value of the neurons of the self-attention network, and b is a bias value. Therefore, for the target page image with small type recognition difficulty, the recognition accuracy of the learning object can be ensured based on the target coefficient without amplifying strong features and shrinking weak features.

Alternatively, the feature vector extraction model may include, but is not limited to, an Efficient Net model, and the feature vector extraction model is different from the object recognition model. The self-care network may in particular be a previous layer of the second classification network.

In order to improve the recognition effect of the model, before the target page image is input into the feature vector extraction model, preprocessing operations such as scaling to a standard size and image data standardization can be performed on the target page image, so that the feature vector extraction model performs self-attention processing on the preprocessed target page image, and further, a second classification network in the feature vector extraction model is utilized to determine classification labels and target feature vectors of learning objects, and meanings represented by the classification labels are used as a second type.

And S140, determining the target type of the learning object in the target page image based on the first type, the target feature vector and the second type.

In some embodiments, the target type is determined by comparing whether the first type and the second type are the same.

In other embodiments, the target type is determined by comparing the tag type carried by the target feature vector to the first type.

In still other embodiments, the second type is determined by comparing the tag type carried by the target feature vector to the second type.

In still other embodiments, the target type is determined by comparing the tag type, the first type, and the second type carried by the target feature vector.

The embodiment of the disclosure provides a learning object identification method, which comprises the steps of obtaining a target page image containing a learning object; performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of a learning object; performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object; based on the first type, the target feature vector, and the second type, a target type of the learning object in the target page image is determined. Therefore, for the page images containing the learning objects, object recognition and feature extraction can be performed on the target page images by using two different models, and the final category of the learning objects is determined by combining the object recognition result and the feature extraction result, so that the recognition precision of the learning objects is improved, and for page images with poor quality and/or page images containing a plurality of different learning objects with small differences, the types of the learning objects can be accurately recognized, further, the interpretation video matched with the learning objects is loaded, and the learning effect of students is ensured.

Further, after determining the target type, the method further comprises:

and S150, loading the explanation video of the learning object from a video library containing a plurality of learning objects based on the target type and playing the explanation video.

Specifically, the target type is matched with the video type of each explanation video in the video library, so that the explanation videos of the learning objects are loaded from the video library containing a plurality of learning objects, and the explanation videos are played, so that students can grasp and know the learning objects by watching the explanation videos, and the learning interest of the students is improved.

In another embodiment of the present disclosure, in order to improve the efficiency of determining the target type, a plurality of variables to be traversed having a high similarity with the target feature vector are determined first, and then the target type is determined by combining the tag type, the first type and the second type carried by the plurality of vectors to be traversed.

For ease of understanding, fig. 2 shows a specific implementation procedure of S140 described above, and referring to fig. 2, the method for determining the target type includes the following steps.

S210, searching a plurality of vectors to be traversed, of which the similarity with the target feature vector meets a first similarity condition, from a pre-constructed feature vector search library.

In this embodiment, the target feature vector is sent to the feature vector search library for query, so as to search out a plurality of feature vectors most similar to the target feature vector from the feature vector search library, and the feature vectors are used as the vectors to be traversed.

The first similarity condition indicates that the similarity between the target feature vector and the feature vector in the feature vector search library is larger than a preset threshold value, or the similarity between the first similarity condition indicates that the similarity between the target feature vector and the feature vector in the feature vector search library is the similarity of a preset sorting position.

Specifically, calculating the similarity between the target feature vector and each feature vector in the feature vector search library, sorting the similarity, selecting N similarity with the top ranking, or selecting M similarity with the similarity larger than a preset threshold, and taking the N similarity or the feature vector corresponding to the M similarity as a plurality of vectors to be traversed. Alternatively, N and M may be 5, or other data.

Alternatively, the feature vector search library may include, but is not limited to, an elastic search database (ES database), a cloud native vector database (Milvus).

S220, determining a target type based on the tag type, the first type and the second type carried by the plurality of vectors to be traversed.

For ease of understanding, fig. 3 shows a specific implementation procedure of S220 described above, and referring to fig. 3, the method for determining the target type includes the following steps.

S310, judging whether the first type is consistent with the second type.

In this embodiment, the target type is directly determined based on the first type and the second type output by the two models, specifically, whether the first type is consistent with the second type is determined, if so, S320 is executed, otherwise S330 is executed.

S320, taking the first type as a target type.

In other cases, if the first type is consistent with the second type, the second type may also be directly targeted.

S330, selecting a vector with highest similarity from a plurality of vectors to be traversed as a target similarity variable of the target feature vector.

Specifically, the similarity between each vector to be traversed and the target feature vector is calculated, then the similarity corresponding to the plurality of vectors to be traversed is sequenced, and the feature vector with the highest similarity is selected as the target similarity vector.

S340, judging whether the first type is consistent with the tag type carried by the target similarity vector.

In this embodiment, the target similarity vector carries a tag type corresponding to the target similarity vector, and the tag type may also be used to determine the target type, specifically, if the first type is inconsistent with the second type, whether the first type is consistent with the tag type carried by the target similarity vector is continuously determined, if so, S350 is continuously performed, otherwise S360 is continuously performed.

S350, taking the first type as a target type.

And S360, searching a plurality of first candidate feature vectors consistent with the tag type carried by the target feature vector from the plurality of vectors to be traversed, and searching a plurality of second candidate feature vectors consistent with the first type from the plurality of vectors to be traversed.

It can be understood that when the first type is inconsistent with the tag type carried by the target similar vector, the reliability of the target type determining manner based on the target similar vector is poor, other reliable feature vectors need to be selected from the multiple to-be-traversed vectors, specifically, the multiple to-be-traversed vectors are continuously traversed, whether the tag type carried by the current traversed vector is consistent with the tag type carried by the target feature vector is judged, if so, the current traversed vector is used as the first candidate feature vector, and whether the tag type carried by the current traversed vector is consistent with the first type is judged, if so, the current traversed vector is used as the second candidate feature vector, so that multiple first candidate feature vectors and multiple second candidate feature vectors are obtained.

S370, the similarity corresponding to each first candidate feature vector is increased by a first value, and the similarity corresponding to each second candidate feature vector is increased by a second value.

In order to improve the reliability of determining the target type based on the plurality of first candidate feature vectors and the plurality of second candidate feature vectors, a predetermined empirical value is obtained, wherein the empirical value comprises a first numerical value S1 and a second numerical value S2, the similarity corresponding to each first candidate feature vector is added with S1, and the similarity corresponding to each second candidate feature vector is added with S2, so that the similarity of each candidate feature is determined in a heightened mode.

S380, selecting a vector to be traversed with the highest similarity from the similarity corresponding to the first candidate feature vector after the first numerical value is increased and the second candidate feature vector after the second numerical value is increased, and taking the label type carried by the vector to be traversed with the highest similarity as the target type.

Specifically, the first candidate feature vector after the first numerical value is increased and the second candidate feature vector after the second numerical value is increased are ordered according to the order from big to small or the order from small to big, the variable to be traversed with the highest similarity is selected according to the ordering result, and the label type carried by the vector to be traversed with the highest similarity is used as the target type.

Therefore, the type of the target can be comprehensively determined by combining the type of the tag, the first type and the second type carried by the target similarity vector, and the accuracy of determining the type of the learning object is improved; when the reliability of the mode of determining the target type based on the target similarity vector is poor, the purpose of repairing inaccurate determination of the original target type is achieved by heightening the first candidate feature vector and the second candidate feature vector.

In yet another embodiment of the present disclosure, the first type may also be directly used as the target type, or the target type may be determined based on the target feature vector and the second type only, or the target type may be determined in combination with the first type, the second type, and the target feature vector.

Fig. 4 shows a logic diagram of another learning object recognition method provided by an embodiment of the present disclosure.

As shown in fig. 4, the learning object recognition method may include the following steps.

S410, acquiring a target page image containing a learning object.

S420, performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of a learning object.

Wherein, S410-S420 are similar to S110-S120, and are not described herein.

S430, taking the first type as the target type.

In this embodiment, in order to directly use the first type as the target type after the first type is acquired, the efficiency of type identification is improved.

S440, performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object.

Wherein S440 is similar to S130, and will not be described herein.

S450, searching a plurality of vectors to be traversed, of which the similarity with the target feature vector meets a first similarity condition, from a pre-constructed feature vector search library.

Wherein S450 is similar to S210, and will not be described herein.

S460, selecting the vector to be traversed with highest similarity from a plurality of vectors to be traversed, and taking the vector to be traversed as a target similarity variable of the target feature vector.

Wherein S460 is similar to S330, and will not be described herein.

S470, judging whether the label type carried by the target similar variable is consistent with the second type.

In this embodiment, the target similarity vector carries a tag type corresponding to the target similarity vector, and the tag type may also be used to determine whether the target type, specifically, the tag type carried by the target similarity variable is consistent with the second type, if so, S480 is executed, otherwise S490 is executed.

S480, taking the label type carried by the target similar variable as the target type.

And S490, searching a plurality of first candidate feature vectors consistent with the tag type carried by the target feature vector from the plurality of vectors to be traversed.

It can be understood that when the tag type carried by the target similarity variable is inconsistent with the second type, the reliability of the manner of determining the target type based on the target similarity vector is poor, other reliable feature vectors need to be selected from the multiple vectors to be traversed, specifically, the multiple vectors to be traversed are continuously traversed, whether the tag type carried by the current traversing vector is consistent with the tag type carried by the target feature vector is judged, and if so, the current traversing vector is used as the first candidate feature vector.

S491, the similarity corresponding to each first candidate feature vector is increased by a first numerical value.

In order to improve the reliability of determining the target type based on a plurality of first candidate feature vectors, a predetermined experience value is obtained as a first numerical value S1, and the similarity corresponding to each first candidate feature vector is added with S1, so that the similarity of each candidate feature is determined in a heightened mode.

S492, selecting a vector to be traversed with the highest similarity from the similarity corresponding to the first candidate feature vector after the first numerical value is increased, and taking the label type carried by the vector to be traversed with the highest similarity as the target type.

Specifically, the first candidate feature vectors after the first numerical value is increased are ordered according to the order from big to small or the order from small to big, the variable to be traversed with the highest similarity is selected according to the ordering result, and the label type carried by the vector to be traversed with the highest similarity is used as the target type.

Therefore, when the reliability of the mode of determining the target type based on the target similarity vector is poor, the purpose of repairing inaccurate determination of the original target type is achieved by heightening the first candidate feature vector.

S493, determining the target type of the learning object in the target page image based on the first type, the target feature vector and the second type.

Wherein S493 is similar to S140, see the description in fig. 2 and 3 for details.

In summary, the first type can be directly taken as the target type, or the target type can be determined in combination with the determination of the target type based on the target feature vector and the second type, or the target type can be determined in combination with the first type, the second type and the target feature vector.

The embodiment of the disclosure further provides a learning object recognition device for implementing the learning object recognition method, and the learning object recognition device is described below with reference to fig. 5. In the embodiment of the disclosure, the learning object recognition apparatus may be an electronic device or a server. The electronic device may include a mobile phone, a tablet computer, a desktop computer, a notebook computer, and other devices with communication functions. The server may be a cloud server or a server cluster, or other devices with storage and computing functions.

Fig. 5 shows a schematic structural diagram of a learning object recognition apparatus provided in an embodiment of the present disclosure.

As shown in fig. 5, the learning object recognition apparatus 500 may include:

an image acquisition module 510 for acquiring a target page image containing a learning object;

the object recognition module 520 is configured to perform object recognition processing on the target page image by using a pre-trained object recognition model, and determine a first type of a learning object;

The feature extraction module 530 is configured to perform feature extraction processing on the target page image by using a feature vector extraction model trained in advance, so as to obtain a target feature vector and a second type of learning object;

the type determining module 540 is configured to determine a target type of the learning object in the target page image based on the first type, the target feature vector, and the second type.

The embodiment of the disclosure provides a learning object recognition device, which acquires a target page image containing a learning object; performing object recognition processing on the target page image by utilizing a pre-trained object recognition model, and determining a first type of a learning object; performing feature extraction processing on the target page image by utilizing a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object; based on the first type, the target feature vector, and the second type, a target type of the learning object in the target page image is determined. Therefore, for the page images containing a plurality of learning objects with smaller differences, object recognition and feature extraction can be performed on the target page images by using two different models, and the final category of the learning objects is determined by combining the object recognition result and the feature extraction result, so that the recognition precision of the learning objects is improved, and for page images with poor quality and/or page images containing a plurality of different learning objects with smaller differences, the types of the learning objects can be accurately recognized, and further the explanation videos matched with the learning objects are loaded, so that the learning interest of students is improved.

In some alternative embodiments, the object recognition module 520 includes:

the first self-attention processing unit is used for carrying out self-attention processing on the target page image according to a target coefficient based on a self-attention network in the object recognition model to obtain a first self-attention characteristic;

and the first classification unit is used for classifying the first self-attention characteristic based on a classification network in the object recognition model to obtain the first type.

In some alternative embodiments, feature extraction module 530 includes:

the second self-attention processing unit is used for carrying out self-attention processing on the target page image according to the target coefficient based on the self-attention network in the feature vector extraction model to obtain a second self-attention feature;

and the second classification unit is used for classifying the second self-attention feature based on a classification network in the feature vector extraction model to obtain the target feature vector and the second type.

In some alternative embodiments, the target coefficient is y=f (W/(max (SUM (W) -W, W))x+b), Y is an output value of a neuron of the self-attention network, f (X) is an activation function, W is a weight of a connection of the neuron of the self-attention network and a neuron of the classification network, X is an input value of the neuron of the self-attention network, and b is a bias value.

In some alternative embodiments, the type determination module 540 includes:

the searching unit is used for searching a plurality of vectors to be traversed, the similarity between the vectors and the target feature vector of which meets a first similarity condition, from a pre-constructed feature vector search library;

the first determining unit is used for determining the target type based on the tag type, the first type and the second type carried by the plurality of vectors to be traversed respectively.

In some alternative embodiments, the first determining unit is specifically configured to:

judging whether the first type is consistent with the second type;

if the first type is consistent with the second type, the first type is taken as the target type;

if the first type is inconsistent with the second type, selecting a vector with highest similarity from the plurality of vectors to be traversed as a target similarity variable of the target feature vector;

judging whether the first type is consistent with the tag type carried by the target similarity vector;

and if the first type is consistent with the tag type carried by the target similarity vector, taking the first type as the target type.

In some alternative embodiments, the first determining unit is further configured to:

if the first type is inconsistent with the tag type carried by the target similarity vector, judging whether the tag type carried by the target similarity variable is consistent with the second type;

and if the tag type carried by the target similar variable is consistent with the second type, taking the tag type carried by the target similar variable as the target type.

if the label type carried by the target similar variable is inconsistent with the second type, searching a plurality of first candidate feature vectors consistent with the label type carried by the target feature vector from the plurality of vectors to be traversed, and searching a plurality of second candidate feature vectors consistent with the first type from the plurality of vectors to be traversed;

the similarity corresponding to each first candidate feature vector is increased by a first numerical value, and the similarity corresponding to each second candidate feature vector is increased by a second numerical value;

selecting a vector to be traversed with highest similarity from the similarity corresponding to the first candidate feature vector after the first numerical value is increased and the second candidate feature vector after the second numerical value is increased, and taking the label type carried by the vector to be traversed with the highest similarity as the target type.

In some alternative embodiments, the apparatus further comprises:

and the first determining module is used for taking the first type as the target type.

In some alternative embodiments, the apparatus further comprises:

the first search module is used for searching a plurality of vectors to be traversed, of which the similarity with the target feature vector meets a first similarity condition, from a pre-constructed feature vector search library;

the first selection module is used for selecting the vector to be traversed with highest similarity from the plurality of vectors to be traversed as a target similarity variable of the target feature vector;

the judging module is used for judging whether the label type carried by the target similar variable is consistent with the second type;

and the second determining module is used for taking the label type carried by the target similar variable as the target type if the label type carried by the target similar variable is consistent with the second type.

In some alternative embodiments, the apparatus further comprises:

the second searching module is used for searching a plurality of first candidate feature vectors which are consistent with the tag type carried by the target feature vector from the plurality of vectors to be traversed if the tag type carried by the target similar variable is inconsistent with the second type;

The heightening module is used for heightening the similarity corresponding to each first candidate feature vector by a first numerical value;

and the second selection module is used for selecting the vector to be traversed with the highest similarity from the similarity corresponding to the first candidate feature vector after the first numerical value is increased, and taking the label type carried by the vector to be traversed with the highest similarity as the target type.

In some alternative embodiments, the image acquisition module 510 includes:

an image acquisition unit for acquiring an initial page image;

the detection unit is used for carrying out four-point frame detection processing on the initial page image by utilizing a pre-trained page detection model to obtain page position information of the initial page image;

the clipping unit is used for clipping candidate page images from the initial page images according to the page position information;

and the correcting unit is used for correcting the initial shape of the pages in the candidate page images into a target shape to obtain the target page image.

In some optional embodiments, the target page image includes a plurality of learning objects, and the similarity between any two learning objects is greater than a preset similarity threshold; and/or the number of the groups of groups,

And the resolution corresponding to the target page image is smaller than a preset resolution threshold.

It should be noted that, the learning object recognition device 500 shown in fig. 5 may perform the steps in the method embodiments shown in fig. 1 to 4, and implement the processes and effects in the method embodiments shown in fig. 1 to 4, which are not described herein.

The exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to embodiments of the present disclosure when executed by the at least one processor.

The present disclosure also provides a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 6, a block diagram of a structure of an electronic device 600 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described, and the electronic device 600 may be the above-described electronic device. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the electronic device 600 can also be stored. The computing unit 601, ROM602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the electronic device 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above. For example, in some embodiments, the learning object recognition method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM602 and/or the communication unit 609. In some embodiments, the computing unit 601 may be configured to perform the learning object recognition method by any other suitable means (e.g., by means of firmware).

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The above is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A learning object recognition method, characterized by comprising:

acquiring a target page image containing a learning object;

2. The method of claim 1, wherein said performing object recognition processing on said target page image using a pre-trained object recognition model, determining a first type of said learning object, comprises:

based on a self-attention network in the object recognition model, performing self-attention processing on the target page image according to a target coefficient to obtain a first self-attention characteristic;

and classifying the first self-attention characteristic based on a classification network in the object recognition model to obtain the first type.

3. The method according to claim 1, wherein the performing feature extraction processing on the target page image using a feature vector extraction model trained in advance to obtain a target feature vector and a second type of the learning object includes:

based on the self-attention network in the feature vector extraction model, performing self-attention processing on the target page image according to a target coefficient to obtain a second self-attention feature;

and classifying the second self-attention feature based on a classification network in the feature vector extraction model to obtain the target feature vector and the second type.

4. A method according to claim 2 or 3, characterized in that the target coefficient is Y = f (W/(max (W) -W, W))x+b), Y is the output value of the neurons of the self-care network, f ()' is an activation function, W is the weight of the connections of the neurons of the self-care network with the neurons of the classification network, X is the input value of the neurons of the self-care network, and b is a bias value.

5. The method of claim 1, wherein the determining the target type of the learning object in the target page image based on the first type, the target feature vector, and the second type comprises:

searching a plurality of vectors to be traversed, of which the similarity with the target feature vector meets a first similarity condition, from a pre-constructed feature vector search library;

and determining the target type based on the tag type, the first type and the second type carried by the plurality of vectors to be traversed respectively.

6. The method of claim 5, wherein the determining the target type based on the tag type, the first type, and the second type carried by the plurality of vectors to be traversed, respectively, comprises:

Judging whether the first type is consistent with the second type;

7. The method as recited in claim 6, further comprising:

8. The method as recited in claim 7, further comprising:

9. The method of claim 1, wherein after said object recognition processing is performed on said target page image using a pre-trained object recognition model, determining a first type of said learning object, said method further comprises:

the first type is taken as the target type.

10. The method of claim 1, wherein after performing feature extraction processing on the target page image using a pre-trained feature vector extraction model to obtain a target feature vector and a second type of the learning object, the method further comprises:

selecting a vector to be traversed with highest similarity from the plurality of vectors to be traversed as a target similarity variable of the target feature vector;

judging whether the label type carried by the target similar variable is consistent with the second type;

11. The method as recited in claim 10, further comprising:

if the label type carried by the target similar variable is inconsistent with the second type, searching a plurality of first candidate feature vectors consistent with the label type carried by the target feature vector from the plurality of vectors to be traversed;

The similarity corresponding to each first candidate feature vector is increased by a first numerical value;

selecting a vector to be traversed with the highest similarity from the similarity corresponding to the first candidate feature vector after the first numerical value is increased, and taking the label type carried by the vector to be traversed with the highest similarity as the target type.

12. The method of claim 1, wherein the acquiring the target page image containing the learning object comprises:

acquiring an initial page image;

performing four-point frame detection processing on the initial page image by using a pre-trained page detection model to obtain page position information of the initial page image;

cutting out candidate page images from the initial page images according to the page position information;

correcting the initial shape of the pages in the candidate page images to be a target shape, and obtaining the target page image.

13. The method according to any one of claim 1 to 12, wherein,

the number of the learning objects contained in the target page image is multiple, and the similarity between any two learning objects is larger than a preset similarity threshold; and/or the number of the groups of groups,

14. A learning object recognition apparatus, characterized by comprising:

15. An electronic device, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the method of any of the preceding claims 1-13.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the storage medium stores a computer program, which, when executed by a processor, causes the processor to implement the method of any of the preceding claims 1-13.